Chapter 7: SQL for Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is an essential step in the data science workflow, helping to uncover patterns, trends, and relationships within data. SQL provides powerful capabilities for performing EDA on structured data stored in relational databases. This chapter will focus on using SQL for EDA, including identifying trends and patterns, generating summary tables, and conducting an interactive exploratory analysis on real datasets.
7.1 Identifying Trends and Patterns with SQL
Trends and patterns in data help identify underlying insights such as seasonal variations, customer behaviors, or sales growth. SQL provides various functions for uncovering these insights.
Key SQL Techniques for Identifying Trends and Patterns
- Aggregations and Grouping: Use
GROUP BY
and aggregation functions (SUM()
,AVG()
,COUNT()
, etc.) to identify trends over time. - Rolling Averages and Moving Averages: Use WINDOW FUNCTIONS to calculate moving averages for smoothing data.
- Date-based Analysis: Use
DATE_TRUNC()
(PostgreSQL),YEAR()
,MONTH()
,DAY()
functions for trend detection. - Correlation Analysis: Identify relationships between variables using SQL queries.
Example 1: Monthly Sales Trends
SELECT
DATE_TRUNC('month', order_date) AS sales_month,
SUM(total_sales) AS total_revenue
FROM ecommerce_sales
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY sales_month;
Example 2: Rolling 3-Month Average for Sales
SELECT
order_date,
SUM(total_sales) OVER (ORDER BY order_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) / 3 AS rolling_avg
FROM sales_data;
Example 3: Identifying Seasonality in Orders
SELECT
MONTH(order_date) AS month,
COUNT(*) AS total_orders
FROM sales_data
GROUP BY MONTH(order_date)
ORDER BY month;
7.2 Generating Summary Tables for EDA
Summary tables help condense large datasets into meaningful statistics for better understanding.
Key SQL Techniques for Summary Tables
- Descriptive Statistics: Use
AVG()
,MIN()
,MAX()
,STDDEV()
, andVARIANCE()
for numerical insights. - Frequency Distributions: Use
COUNT()
andGROUP BY
to analyze categorical variables. - Pivot Tables: Use
CASE WHEN
orCROSS TAB
(PostgreSQL) to pivot data for better visualization. - Percentile Calculations: Use
PERCENTILE_CONT()
to calculate median or quartiles.
Example 1: Summary Table for Customer Purchases
SELECT
customer_id,
COUNT(order_id) AS total_orders,
SUM(total_sales) AS total_spent,
AVG(total_sales) AS avg_order_value
FROM ecommerce_sales
GROUP BY customer_id
ORDER BY total_spent DESC;
This summary table provides insights into customer purchasing behavior.
Example 2: Pivot Table for Sales by Category
SELECT
category,
SUM(CASE WHEN MONTH(order_date) = 1 THEN total_sales ELSE 0 END) AS Jan_sales,
SUM(CASE WHEN MONTH(order_date) = 2 THEN total_sales ELSE 0 END) AS Feb_sales,
SUM(CASE WHEN MONTH(order_date) = 3 THEN total_sales ELSE 0 END) AS Mar_sales,
SUM(total_sales) AS total_sales
FROM ecommerce_sales
GROUP BY category
ORDER BY total_sales DESC;
This query creates a pivot table to compare sales by category across years.
Example 3: Distribution of Order Values
SELECT
total_sales,
NTILE(5) OVER (ORDER BY total_sales) AS sales_quintile
FROM ecommerce_sales;
This query divides order values into five bins to observe distribution.
7.3 Interactive Exercise: Exploratory Analysis on Real Datasets
Scenario: Analyzing an E-Commerce Sales Dataset
Dataset: ecommerce_sales
Columns: order_id, customer_id, product_id, category, order_date, total_sales, quantity_sold
Task 1: Find the Top 5 Best-Selling Products
SELECT TOP 5
product_id,
SUM(quantity_sold) AS total_units_sold
FROM ecommerce_sales
GROUP BY product_id
ORDER BY total_units_sold DESC;
Task 2: Identify the Most Profitable Month
SELECT TOP 1
DATE_TRUNC('month', order_date) AS sales_month,
SUM(total_sales) AS revenue
FROM ecommerce_sales
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY revenue DESC;
Task 3: Find the Most Loyal Customers
SELECT TOP 10
customer_id,
COUNT(order_id) AS total_orders,
SUM(total_sales) AS total_spent
FROM ecommerce_sales
GROUP BY customer_id
ORDER BY total_orders DESC;
Task 4: Detect Monthly Sales Trends
SELECT
DATETRUNC(MONTH, order_date) AS sales_month,
SUM(total_sales) AS monthly_sales
FROM ecommerce_sales
GROUP BY DATETRUNC(MONTH, order_date)
ORDER BY monthly_sales;
Task 5: Check Sales Performance by Category
SELECT
category,
SUM(total_sales) AS total_sales,
AVG(total_sales) AS avg_sales_per_order
FROM ecommerce_sales
GROUP BY category
ORDER BY total_sales DESC;
Sample Dataset
SQL Script for Database Creation
CREATE DATABASE EDADB;
USE EDADB;
-- Create table sales_data
CREATE TABLE sales_data (
sale_id INT PRIMARY KEY,
order_date DATE,
product_id INT,
category VARCHAR(50),
total_sales DECIMAL(10,2),
quantity_sold INT
);
-- Insert sample records into sales_data table
INSERT INTO sales_data (sale_id, order_date, product_id, category, total_sales, quantity_sold) VALUES
(1, '2023-01-10', 1001, 'Electronics', 500.00, 1),
(2, '2023-01-15', 1002, 'Electronics', 300.00, 2),
(3, '2023-02-01', 1003, 'Clothing', 150.00, 3),
(4, '2023-02-05', 1004, 'Home & Kitchen', 80.00, 1),
(5, '2023-03-01', 1002, 'Electronics', 600.00, 2);
-- Creating table ecommerce_sales
CREATE TABLE ecommerce_sales (
order_id INT IDENTITY(1,1) PRIMARY KEY,
customer_id INT NOT NULL,
product_id INT NOT NULL,
category VARCHAR(100),
order_date DATE NOT NULL,
total_sales DECIMAL(10,2),
quantity_sold INT
);
-- Create table products
CREATE TABLE products (
product_id INT IDENTITY(1,1) PRIMARY KEY,
product_name VARCHAR(255) NOT NULL,
category VARCHAR(100)
);
-- Create table customers
CREATE TABLE customers (
customer_id INT IDENTITY(1,1) PRIMARY KEY,
customer_name VARCHAR(255) NOT NULL,
email VARCHAR(255)
);
-- Insert sample data into products
INSERT INTO products (product_name, category) VALUES
('Laptop', 'Electronics'),
('Smartphone', 'Electronics'),
('Headphones', 'Electronics');
-- Insert sample data into customers
INSERT INTO customers (customer_name, email) VALUES
('John Doe', 'johndoe@email.com'),
('Jane Smith', 'janesmith@email.com');
-- Insert sample data into ecommerce_sales
INSERT INTO ecommerce_sales (customer_id, product_id, category, order_date, total_sales, quantity_sold) VALUES
(1, 1, 'Electronics', '2024-01-10', 1200.00, 1),
(2, 2, 'Electronics', '2024-01-15', 800.00, 1);
Conclusion
SQL is a powerful tool for performing EDA, allowing analysts to uncover key insights by identifying trends, generating summary tables, and conducting hands-on exploratory exercises. The exercises in this chapter provide a strong foundation for leveraging SQL to understand real-world datasets.