Chapter 8: Statistical Analysis Using SQL

This chapter explores statistical computations using SQL queries, demonstrating practical methods for analyzing data.

Note:"Could you please adjust this script to ensure compatibility with PostgreSQL? It currently targets a SQL Server environment."

8.1 Computing Basic Descriptive Statistics

Descriptive statistics summarize data distribution. Key statistics include:

  • Count: Number of observations
  • Mean: Average of all values
  • Median: Middle value of dataset

Example SQL Query:


SELECT 
    COUNT(sale_amount) AS count_sales,
    AVG(sale_amount) AS mean_sales,
    MIN(sale_amount) AS min_sales,
    MAX(sale_amount) AS max_sales,
    STDEV(sale_amount) AS std_dev_sales,
    VAR(sale_amount) AS variance_sales
FROM sales_data;

Median Calculation:


SELECT 
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sale_amount) OVER() AS median_sales
FROM sales_data;

8.2 Correlation and Covariance in SQL

Correlation measures linear relationships. Covariance indicates relationship direction.

SQL Query for Pearson Correlation Coefficient:


WITH Stats AS (
    SELECT AVG(sale_amount) AS avg_x, AVG(customer_spending) AS avg_y FROM sales_data
), Deviation AS (
    SELECT sale_amount, customer_spending,
           (sale_amount - (SELECT avg_x FROM Stats)) AS dev_x,
           (customer_spending - (SELECT avg_y FROM Stats)) AS dev_y
    FROM sales_data
), SquaredSums AS (
    SELECT SUM(dev_x * dev_y) AS sum_xy,
           SQRT(SUM(dev_x * dev_x)) AS sqrt_xx,
           SQRT(SUM(dev_y * dev_y)) AS sqrt_yy
    FROM Deviation
)
SELECT (sum_xy / NULLIF(sqrt_xx * sqrt_yy, 0)) AS correlation_sales_spending
FROM SquaredSums;

SQL Query for Covariance:


WITH Stats AS (
    SELECT AVG(sale_amount) AS avg_x, AVG(customer_spending) AS avg_y FROM sales_data
), Deviation AS (
    SELECT sale_amount, customer_spending,
           (sale_amount - avg_x) AS dev_x,
           (customer_spending - avg_y) AS dev_y
    FROM sales_data, Stats
)
SELECT SUM(dev_x * dev_y)/COUNT(*) AS covariance_population FROM Deviation;

8.3 Implementing Linear Regression in SQL

Linear regression identifies relationships using the equation: Y = mX + b.

SQL Query for Linear Regression Coefficients:


WITH Stats AS (
    SELECT AVG(customer_spending) AS avg_y, AVG(sale_amount) AS avg_x FROM sales_data
), Deviation AS (
    SELECT sale_amount, customer_spending,
           (sale_amount - avg_x) AS dev_x,
           (customer_spending - avg_y) AS dev_y
    FROM sales_data
), Regression AS (
    SELECT SUM(dev_x * dev_y) / SUM(dev_x * dev_x) AS slope,
           avg_y - (SUM(dev_x * dev_y) / SUM(dev_x * dev_x) * avg_x) AS intercept
    FROM Deviation, Stats
)
SELECT slope, intercept FROM Regression;

8.4 Exercise: SQL for Statistical Insights

Task 1: Compute Statistics by Department


SELECT department, COUNT(*) AS count_employees,
       AVG(salary) AS avg_salary, MIN(salary) AS min_salary,
       MAX(salary) AS max_salary, STDEV(salary) AS std_dev_salary
FROM employees
GROUP BY department;

Task 2: Correlation Between Salary and Experience:


WITH Stats AS (
    SELECT AVG(salary) AS avg_x, AVG(experience_years) AS avg_y FROM employees
), Deviation AS (
    SELECT (salary - avg_x) AS dev_x, (experience_years - avg_y) AS dev_y
    FROM employees, Stats
), SquaredSums AS (
    SELECT SUM(dev_x * dev_y) AS sum_xy,
           SQRT(SUM(dev_x * dev_x)) AS sqrt_xx,
           SQRT(SUM(dev_y * dev_y)) AS sqrt_yy
    FROM Deviation
)
SELECT sum_xy / (sqrt_xx * sqrt_yy) AS salary_experience_correlation
FROM SquaredSums;

Task 3: Predict Salary increase per additional year of experience Using Linear Regression


WITH Stats AS (
SELECT
AVG(experience_years) AS avg_x,  -- Independent variable (X)
AVG(salary) AS avg_y  -- Dependent variable (Y)
FROM employees
),
Deviation AS (
SELECT
experience_years, salary,
(experience_years - (SELECT avg_x FROM Stats)) AS dev_x,
(salary - (SELECT avg_y FROM Stats)) AS dev_y
FROM employees
),
Regression AS (
SELECT
SUM(dev_x * dev_y) / NULLIF(SUM(dev_x * dev_x), 0) AS slope,  -- REGR_SLOPE equivalent
(SELECT avg_y FROM Stats) -
((SUM(dev_x * dev_y) / NULLIF(SUM(dev_x * dev_x), 0)) * (SELECT avg_x FROM Stats)) AS intercept -- REGR_INTERCEPT equivalent
FROM Deviation
)
SELECT * FROM Regression;

Conclusion

SQL provides powerful functions to perform statistical analysis directly on databases without needing external tools. By understanding descriptive statistics, correlation, covariance, and linear regression, we can gain actionable insights from data.