Chapter 5: Advanced SQL Queries for Data Science
Advanced SQL techniques are essential for data science, allowing you to manipulate, filter, and analyze complex datasets efficiently. This chapter covers CASE statements, Common Table Expressions (CTEs), subqueries, window functions, and an interactive exercise to reinforce these concepts.
5.1 CASE Statements and Conditional Logic
The CASE
statement handles conditional logic in SQL, similar to IF-ELSE
statements in programming.
SELECT customer_id, age,
CASE
WHEN age < 18 THEN 'Minor'
WHEN age BETWEEN 18 AND 35 THEN 'Young Adult'
WHEN age BETWEEN 36 AND 55 THEN 'Middle-aged'
ELSE 'Senior'
END AS age_group
FROM customers;
5.2 Common Table Expressions (CTEs)
CTEs provide temporary named result sets used within a query to improve readability and modularity.
WITH TopSalaries AS (
SELECT employee_id, first_name, last_name, salary
FROM employees
ORDER BY salary DESC
LIMIT 5
)
SELECT * FROM TopSalaries;
5.3 Subqueries and Nested Queries
Subqueries allow a query to use the results of another query as input.
SELECT employee_id, first_name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
e>
5.4 Window Functions (RANK, DENSE_RANK, ROW_NUMBER)
Window functions perform calculations across a set of table rows related to the current row without collapsing them.
SELECT employee_id, first_name, salary,
RANK() OVER (ORDER BY salary DESC) AS rank,
DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank,
ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_number
FROM employees;
Employee | Salary | RANK() | DENSE_RANK() | ROW_NUMBER() |
---|---|---|---|---|
Alice | 10000 | 1 | 1 | 1 |
Bob | 9000 | 2 | 2 | 2 |
Charlie | 9000 | 2 | 2 | 3 |
David | 8000 | 4 | 3 | 4 |
5.5 Interactive Exercise: Advanced Data Transformations
Task 1: Calculate the Total Revenue Per Customer
SELECT customer_id, SUM(quantity * price) AS total_revenue
FROM sales
GROUP BY customer_id;
Task 2: Rank Customers by Total Revenue
WITH CustomerRevenue AS (
SELECT customer_id, SUM(quantity * price) AS total_revenue
FROM sales
GROUP BY customer_id
)
SELECT customer_id, total_revenue,
RANK() OVER (ORDER BY total_revenue DESC) AS revenue_rank
FROM CustomerRevenue;
Task 3: Find the Top-Selling Product Per Month
WITH MonthlySales AS (
SELECT DATE_FORMAT(order_date, '%Y-%m') AS month,
product_id,
SUM(quantity) AS total_sold
FROM sales
GROUP BY month, product_id
)
SELECT month, product_id, total_sold,
RANK() OVER (PARTITION BY month ORDER BY total_sold DESC) AS product_rank
FROM MonthlySales
WHERE product_rank = 1;
Conclusion
This chapter covered powerful advanced SQL techniques to optimize data analysis for data science:
- CASE statements for conditional logic.
- CTEs for modular queries.
- Subqueries for dynamic filtering.
- Window functions for ranking and transformations.
These concepts will help you write more efficient and expressive queries for analyzing large datasets.