Chapter 8: Statistical Analysis Using SQL
This chapter explores statistical computations using SQL queries, demonstrating practical methods for analyzing data.
Note:"Could you please adjust this script to ensure compatibility with PostgreSQL? It currently targets a SQL Server environment."
8.1 Computing Basic Descriptive Statistics
Descriptive statistics summarize data distribution. Key statistics include:
- Count: Number of observations
- Mean: Average of all values
- Median: Middle value of dataset
Example SQL Query:
SELECT
COUNT(sale_amount) AS count_sales,
AVG(sale_amount) AS mean_sales,
MIN(sale_amount) AS min_sales,
MAX(sale_amount) AS max_sales,
STDEV(sale_amount) AS std_dev_sales,
VAR(sale_amount) AS variance_sales
FROM sales_data;
Median Calculation:
SELECT
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sale_amount) OVER() AS median_sales
FROM sales_data;
8.2 Correlation and Covariance in SQL
Correlation measures linear relationships. Covariance indicates relationship direction.
SQL Query for Pearson Correlation Coefficient:
WITH Stats AS (
SELECT AVG(sale_amount) AS avg_x, AVG(customer_spending) AS avg_y FROM sales_data
), Deviation AS (
SELECT sale_amount, customer_spending,
(sale_amount - (SELECT avg_x FROM Stats)) AS dev_x,
(customer_spending - (SELECT avg_y FROM Stats)) AS dev_y
FROM sales_data
), SquaredSums AS (
SELECT SUM(dev_x * dev_y) AS sum_xy,
SQRT(SUM(dev_x * dev_x)) AS sqrt_xx,
SQRT(SUM(dev_y * dev_y)) AS sqrt_yy
FROM Deviation
)
SELECT (sum_xy / NULLIF(sqrt_xx * sqrt_yy, 0)) AS correlation_sales_spending
FROM SquaredSums;
SQL Query for Covariance:
WITH Stats AS (
SELECT AVG(sale_amount) AS avg_x, AVG(customer_spending) AS avg_y FROM sales_data
), Deviation AS (
SELECT sale_amount, customer_spending,
(sale_amount - avg_x) AS dev_x,
(customer_spending - avg_y) AS dev_y
FROM sales_data, Stats
)
SELECT SUM(dev_x * dev_y)/COUNT(*) AS covariance_population FROM Deviation;
8.3 Implementing Linear Regression in SQL
Linear regression identifies relationships using the equation: Y = mX + b
.
SQL Query for Linear Regression Coefficients:
WITH Stats AS (
SELECT AVG(customer_spending) AS avg_y, AVG(sale_amount) AS avg_x FROM sales_data
), Deviation AS (
SELECT sale_amount, customer_spending,
(sale_amount - avg_x) AS dev_x,
(customer_spending - avg_y) AS dev_y
FROM sales_data
), Regression AS (
SELECT SUM(dev_x * dev_y) / SUM(dev_x * dev_x) AS slope,
avg_y - (SUM(dev_x * dev_y) / SUM(dev_x * dev_x) * avg_x) AS intercept
FROM Deviation, Stats
)
SELECT slope, intercept FROM Regression;
8.4 Exercise: SQL for Statistical Insights
Task 1: Compute Statistics by Department
SELECT department, COUNT(*) AS count_employees,
AVG(salary) AS avg_salary, MIN(salary) AS min_salary,
MAX(salary) AS max_salary, STDEV(salary) AS std_dev_salary
FROM employees
GROUP BY department;
Task 2: Correlation Between Salary and Experience:
WITH Stats AS (
SELECT AVG(salary) AS avg_x, AVG(experience_years) AS avg_y FROM employees
), Deviation AS (
SELECT (salary - avg_x) AS dev_x, (experience_years - avg_y) AS dev_y
FROM employees, Stats
), SquaredSums AS (
SELECT SUM(dev_x * dev_y) AS sum_xy,
SQRT(SUM(dev_x * dev_x)) AS sqrt_xx,
SQRT(SUM(dev_y * dev_y)) AS sqrt_yy
FROM Deviation
)
SELECT sum_xy / (sqrt_xx * sqrt_yy) AS salary_experience_correlation
FROM SquaredSums;
Task 3: Predict Salary increase per additional year of experience Using Linear Regression
WITH Stats AS (
SELECT
AVG(experience_years) AS avg_x, -- Independent variable (X)
AVG(salary) AS avg_y -- Dependent variable (Y)
FROM employees
),
Deviation AS (
SELECT
experience_years, salary,
(experience_years - (SELECT avg_x FROM Stats)) AS dev_x,
(salary - (SELECT avg_y FROM Stats)) AS dev_y
FROM employees
),
Regression AS (
SELECT
SUM(dev_x * dev_y) / NULLIF(SUM(dev_x * dev_x), 0) AS slope, -- REGR_SLOPE equivalent
(SELECT avg_y FROM Stats) -
((SUM(dev_x * dev_y) / NULLIF(SUM(dev_x * dev_x), 0)) * (SELECT avg_x FROM Stats)) AS intercept -- REGR_INTERCEPT equivalent
FROM Deviation
)
SELECT * FROM Regression;
Conclusion
SQL provides powerful functions to perform statistical analysis directly on databases without needing external tools. By understanding descriptive statistics, correlation, covariance, and linear regression, we can gain actionable insights from data.