Chapter 7: SQL for Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is an essential step in the data science workflow, helping to uncover patterns, trends, and relationships within data. SQL provides powerful capabilities for performing EDA on structured data stored in relational databases. This chapter will focus on using SQL for EDA, including identifying trends and patterns, generating summary tables, and conducting an interactive exploratory analysis on real datasets.

7.1 Identifying Trends and Patterns with SQL

Trends and patterns in data help identify underlying insights such as seasonal variations, customer behaviors, or sales growth. SQL provides various functions for uncovering these insights.

Key SQL Techniques for Identifying Trends and Patterns
  • Aggregations and Grouping: Use GROUP BY and aggregation functions (SUM(), AVG(), COUNT(), etc.) to identify trends over time.
  • Rolling Averages and Moving Averages: Use WINDOW FUNCTIONS to calculate moving averages for smoothing data.
  • Date-based Analysis: Use DATE_TRUNC() (PostgreSQL), YEAR(), MONTH(), DAY() functions for trend detection.
  • Correlation Analysis: Identify relationships between variables using SQL queries.

Example 1: Monthly Sales Trends


SELECT 
    DATE_TRUNC('month', order_date) AS sales_month, 
    SUM(total_sales) AS total_revenue
FROM ecommerce_sales
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY sales_month;
    

Example 2: Rolling 3-Month Average for Sales


SELECT 
    order_date, 
    SUM(total_sales) OVER (ORDER BY order_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) / 3 AS rolling_avg
FROM sales_data;
    

Example 3: Identifying Seasonality in Orders


SELECT 
    MONTH(order_date) AS month, 
    COUNT(*) AS total_orders
FROM sales_data
GROUP BY MONTH(order_date)
ORDER BY month;
    

7.2 Generating Summary Tables for EDA

Summary tables help condense large datasets into meaningful statistics for better understanding.

Key SQL Techniques for Summary Tables
  • Descriptive Statistics: Use AVG(), MIN(), MAX(), STDDEV(), and VARIANCE() for numerical insights.
  • Frequency Distributions: Use COUNT() and GROUP BY to analyze categorical variables.
  • Pivot Tables: Use CASE WHEN or CROSS TAB (PostgreSQL) to pivot data for better visualization.
  • Percentile Calculations: Use PERCENTILE_CONT() to calculate median or quartiles.

Example 1: Summary Table for Customer Purchases


SELECT 
    customer_id, 
    COUNT(order_id) AS total_orders, 
    SUM(total_sales) AS total_spent, 
    AVG(total_sales) AS avg_order_value
FROM ecommerce_sales
GROUP BY customer_id
ORDER BY total_spent DESC;
    

This summary table provides insights into customer purchasing behavior.

Example 2: Pivot Table for Sales by Category


SELECT
category,
SUM(CASE WHEN MONTH(order_date) = 1 THEN total_sales ELSE 0 END) AS Jan_sales,
SUM(CASE WHEN MONTH(order_date) = 2 THEN total_sales ELSE 0 END) AS Feb_sales,
SUM(CASE WHEN MONTH(order_date) = 3 THEN total_sales ELSE 0 END) AS Mar_sales,
SUM(total_sales) AS total_sales
FROM ecommerce_sales
GROUP BY category
ORDER BY total_sales DESC;
    

This query creates a pivot table to compare sales by category across years.

Example 3: Distribution of Order Values


SELECT
total_sales,
NTILE(5) OVER (ORDER BY total_sales) AS sales_quintile
FROM ecommerce_sales;
                     

This query divides order values into five bins to observe distribution.

7.3 Interactive Exercise: Exploratory Analysis on Real Datasets

Scenario: Analyzing an E-Commerce Sales Dataset

Dataset: ecommerce_sales

Columns: order_id, customer_id, product_id, category, order_date, total_sales, quantity_sold

Task 1: Find the Top 5 Best-Selling Products

SELECT TOP 5
    product_id,
    SUM(quantity_sold) AS total_units_sold
FROM ecommerce_sales
GROUP BY product_id
ORDER BY total_units_sold DESC;
Task 2: Identify the Most Profitable Month

SELECT TOP 1
    DATE_TRUNC('month', order_date) AS sales_month,
    SUM(total_sales) AS revenue
FROM ecommerce_sales
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY revenue DESC;
    
Task 3: Find the Most Loyal Customers

SELECT TOP 10
customer_id,
COUNT(order_id) AS total_orders,
SUM(total_sales) AS total_spent
FROM ecommerce_sales
GROUP BY customer_id
ORDER BY total_orders DESC;
  
Task 4: Detect Monthly Sales Trends

SELECT
DATETRUNC(MONTH, order_date) AS sales_month,
SUM(total_sales) AS monthly_sales
FROM ecommerce_sales
GROUP BY DATETRUNC(MONTH, order_date)
ORDER BY monthly_sales;
  
Task 5: Check Sales Performance by Category

SELECT
category,
SUM(total_sales) AS total_sales,
AVG(total_sales) AS avg_sales_per_order
FROM ecommerce_sales
GROUP BY category
ORDER BY total_sales DESC;
                      

Sample Dataset

SQL Script for Database Creation

CREATE DATABASE EDADB;

USE EDADB;

-- Create table sales_data
CREATE TABLE sales_data (
    sale_id INT PRIMARY KEY,
    order_date DATE,
    product_id INT,
    category VARCHAR(50),
    total_sales DECIMAL(10,2),
    quantity_sold INT
);

-- Insert sample records into sales_data table
INSERT INTO sales_data (sale_id, order_date, product_id, category, total_sales, quantity_sold) VALUES
(1, '2023-01-10', 1001, 'Electronics', 500.00, 1),
(2, '2023-01-15', 1002, 'Electronics', 300.00, 2),
(3, '2023-02-01', 1003, 'Clothing', 150.00, 3),
(4, '2023-02-05', 1004, 'Home & Kitchen', 80.00, 1),
(5, '2023-03-01', 1002, 'Electronics', 600.00, 2);

-- Creating table ecommerce_sales
CREATE TABLE ecommerce_sales (
    order_id INT IDENTITY(1,1) PRIMARY KEY,
    customer_id INT NOT NULL,
    product_id INT NOT NULL,
    category VARCHAR(100),
    order_date DATE NOT NULL,
    total_sales DECIMAL(10,2),
    quantity_sold INT
);

-- Create table products
CREATE TABLE products (
    product_id INT IDENTITY(1,1) PRIMARY KEY,
    product_name VARCHAR(255) NOT NULL,
    category VARCHAR(100)
);

-- Create table customers
CREATE TABLE customers (
    customer_id INT IDENTITY(1,1) PRIMARY KEY,
    customer_name VARCHAR(255) NOT NULL,
    email VARCHAR(255)
);

-- Insert sample data into products
INSERT INTO products (product_name, category) VALUES
('Laptop', 'Electronics'),
('Smartphone', 'Electronics'),
('Headphones', 'Electronics');

-- Insert sample data into customers
INSERT INTO customers (customer_name, email) VALUES
('John Doe', 'johndoe@email.com'),
('Jane Smith', 'janesmith@email.com');

-- Insert sample data into ecommerce_sales
INSERT INTO ecommerce_sales (customer_id, product_id, category, order_date, total_sales, quantity_sold) VALUES
(1, 1, 'Electronics', '2024-01-10', 1200.00, 1),
(2, 2, 'Electronics', '2024-01-15', 800.00, 1);
  

Conclusion

SQL is a powerful tool for performing EDA, allowing analysts to uncover key insights by identifying trends, generating summary tables, and conducting hands-on exploratory exercises. The exercises in this chapter provide a strong foundation for leveraging SQL to understand real-world datasets.