>Chapter 6: Data Cleaning and Preprocessing with SQL

Effective data analysis begins with clean and well-structured data. SQL provides powerful tools to handle missing values, convert data types, manipulate strings, remove duplicates, and manage outliers. In this chapter, we will explore how to preprocess data in SQL before analysis.

6.1 Handling Missing Data (NULLIF, COALESCE)

Understanding NULL Values

In SQL, NULL represents missing or unknown data. NULL values can cause problems in calculations and joins, so handling them is crucial.

Using NULLIF

The NULLIF function helps handle data inconsistencies by returning NULL if two given values are equal. Otherwise, it returns the first value.


SELECT employee_id, salary,
salary / NULLIF(bonus, 0) AS adjusted_salary
FROM employees;
               
  • If bonus is 0, NULLIF(bonus, 0) returns NULL, preventing a division-by-zero error.
  • If bonus is non-zero, division proceeds normally.
Using COALESCE

The COALESCE function replaces NULL values with a default or fallback value.


SELECT customer_id, COALESCE(phone_number, 'Not Provided') AS phone
FROM customers;
                

If phone_number is NULL, it is replaced with "Not Provided".

Example: Handling NULL in Calculations

SELECT order_id, COALESCE(discount, 0) AS discount_amount
FROM orders;
                

If discount is NULL, it is replaced with 0 to ensure calculations work properly.

6.2 Data Type Conversions and String Manipulations

Converting Data Types (CAST, CONVERT)

 SELECT CAST('123' AS INT) AS converted_value;
                

 SELECT CONVERT(INT, '123') AS converted_value; -- SQL Server
               

SELECT CAST('2024-03-02' AS DATE) AS formatted_date;
               
String Manipulations

SELECT UPPER(name) AS uppercase_name, LOWER(name) AS lowercase_name
FROM users;
                

SELECT TRIM('   Hello World   ') AS trimmed_string;
                

SELECT SUBSTRING(name, 1, 5) AS short_name FROM employees;
                

SELECT REPLACE(address, 'Street', 'St.') AS short_address FROM customers;
                

6.3 Removing Duplicates and Handling Outliers

Removing Duplicates (DISTINCT, ROW_NUMBER)

SELECT DISTINCT customer_id, product_id FROM orders;
                

WITH RankedOrders AS (
SELECT customer_id, product_id,
ROW_NUMBER() OVER (PARTITION BY customer_id, product_id ORDER BY order_date) AS row_num
FROM orders
)
DELETE FROM orders WHERE row_num > 1;
               
Handling Outliers

WITH stats AS (
SELECT AVG(salary) AS avg_salary, STDDEV(salary) AS std_salary FROM employees
)

SELECT * FROM employees WHERE salary > (SELECT avg_salary + 3 * std_salary FROM stats);
               

WITH Quartiles AS (
SELECT PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY salary) AS Q1,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY salary) AS Q3
FROM employees
)
SELECT * FROM employees WHERE salary < (SELECT Q1 - 1.5 * (Q3 - Q1) FROM Quartiles)
OR salary > (SELECT Q3 + 1.5 * (Q3 - Q1) FROM Quartiles);
                

6.4 Interactive Exercise: Preparing Data for Analysis

Scenario: Cleaning a Sales Dataset
Step 1: Handling Missing Data

UPDATE sales SET discount = COALESCE(discount, 0);
               
Step 2: Converting Data Types

ALTER TABLE sales ALTER COLUMN sale_date TYPE DATE;
                
Step 3: Removing Duplicates

WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY sale_id ORDER BY sale_date) AS row_num
FROM sales
)

DELETE FROM sales WHERE row_num > 1;
               
Step 4: Handling Outliers

DELETE FROM sales WHERE total_price > (SELECT AVG(total_price) + 3 * STDDEV(total_price) FROM sales);
               

Summary

  • NULLIF & COALESCE: Handle missing data by avoiding errors and providing fallback values.
  • Data Type Conversions: Use CAST and CONVERT to format data correctly.
  • String Manipulations: Clean text fields with UPPER, LOWER, TRIM, SUBSTRING, and REPLACE.
  • Removing Duplicates: Use DISTINCT and ROW_NUMBER() for duplicate handling.
  • Handling Outliers: Use statistical methods like Z-score and IQR to filter anomalies.
  • Interactive Exercise: Applied all techniques to clean a sales dataset.