Chapter 5: Advanced SQL Queries for Data Science

Advanced SQL techniques are essential for data science, allowing you to manipulate, filter, and analyze complex datasets efficiently. This chapter covers CASE statements, Common Table Expressions (CTEs), subqueries, window functions, and an interactive exercise to reinforce these concepts.

5.1 CASE Statements and Conditional Logic

The CASE statement handles conditional logic in SQL, similar to IF-ELSE statements in programming.


SELECT customer_id, age,
       CASE 
           WHEN age < 18 THEN 'Minor'
           WHEN age BETWEEN 18 AND 35 THEN 'Young Adult'
           WHEN age BETWEEN 36 AND 55 THEN 'Middle-aged'
           ELSE 'Senior'
       END AS age_group
FROM customers;
 

5.2 Common Table Expressions (CTEs)

CTEs provide temporary named result sets used within a query to improve readability and modularity.


WITH TopSalaries AS (
    SELECT employee_id, first_name, last_name, salary
    FROM employees
    ORDER BY salary DESC
    LIMIT 5
)
SELECT * FROM TopSalaries;

5.3 Subqueries and Nested Queries

Subqueries allow a query to use the results of another query as input.


SELECT employee_id, first_name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
e>

5.4 Window Functions (RANK, DENSE_RANK, ROW_NUMBER)

Window functions perform calculations across a set of table rows related to the current row without collapsing them.


SELECT employee_id, first_name, salary,
       RANK() OVER (ORDER BY salary DESC) AS rank,
       DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank,
       ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_number
FROM employees;
Employee Salary RANK() DENSE_RANK() ROW_NUMBER()
Alice 10000 1 1 1
Bob 9000 2 2 2
Charlie 9000 2 2 3
David 8000 4 3 4

5.5 Interactive Exercise: Advanced Data Transformations

Task 1: Calculate the Total Revenue Per Customer

SELECT customer_id, SUM(quantity * price) AS total_revenue
FROM sales
GROUP BY customer_id;
Task 2: Rank Customers by Total Revenue

WITH CustomerRevenue AS (
    SELECT customer_id, SUM(quantity * price) AS total_revenue
    FROM sales
    GROUP BY customer_id
)
SELECT customer_id, total_revenue,
       RANK() OVER (ORDER BY total_revenue DESC) AS revenue_rank
FROM CustomerRevenue;
Task 3: Find the Top-Selling Product Per Month

WITH MonthlySales AS (
    SELECT DATE_FORMAT(order_date, '%Y-%m') AS month, 
           product_id, 
           SUM(quantity) AS total_sold
    FROM sales
    GROUP BY month, product_id
)
SELECT month, product_id, total_sold,
       RANK() OVER (PARTITION BY month ORDER BY total_sold DESC) AS product_rank
FROM MonthlySales
WHERE product_rank = 1;

Conclusion

This chapter covered powerful advanced SQL techniques to optimize data analysis for data science:

  • CASE statements for conditional logic.
  • CTEs for modular queries.
  • Subqueries for dynamic filtering.
  • Window functions for ranking and transformations.

These concepts will help you write more efficient and expressive queries for analyzing large datasets.