Chapter 2: Fundamentals of Descriptive Statistics

I. Measures of Central Tendency

1. Mean

Definition: The mean is the arithmetic average of a set of numbers. It is calculated by summing all the observations and dividing by the number of observations.

Formula:

$$\text{Mean} = \frac{\sum_{i=1}^{n} x_i}{n}$$

Example: Consider the dataset: 4, 8, 15, 16, 23, 42.

  • Step 1: Sum the numbers: \(4 + 8 + 15 + 16 + 23 + 42 = 108\)
  • Step 2: Count the observations: 6 numbers
  • Step 3: Calculate the mean: \(\frac{108}{6} = 18\)

Usage and Considerations: The mean is sensitive to extreme values (outliers) and is widely used when the data distribution is symmetrical.

2. Median

Definition: The median is the middle value of an ordered dataset, dividing the dataset into two halves.

Procedure:

  1. Order the Data: Arrange the values from smallest to largest.
  2. Determine the Middle:
    • If the number of observations is odd, the median is the middle number.
    • If even, the median is the average of the two middle numbers.

Example (Odd Number of Observations): Dataset: 3, 7, 9, 12, 15. The middle value is 9.

Example (Even Number of Observations): Dataset: 3, 7, 9, 12. The two middle values are 7 and 9. Median: $$\frac{7+9}{2} = 8$$

Usage and Considerations: The median is robust to outliers and is particularly useful in skewed distributions.

3. Mode

Definition: The mode is the most frequently occurring value in a dataset.

Example 1 (Single Mode): Dataset: 2, 4, 4, 6, 8. Mode: 4.

Example 2 (Bimodal): Dataset: 5, 5, 7, 8, 8, 9. Modes: 5 and 8.

Usage and Considerations: The mode can be used with nominal data and is less affected by outliers compared to the mean.

Summary

  • Mean: Provides the arithmetic average but is influenced by outliers.
  • Median: Represents the middle value and is robust to outliers.
  • Mode: Represents the most frequent value and is applicable to categorical data.

II. Measures of Dispersion (Variability)

Measures of dispersion describe the spread or variability of the data, indicating how much the data values differ from the mean.

1. Range

Definition: The difference between the largest and smallest values in a dataset.

Formula: \( \text{Range} = \text{Max} - \text{Min} \)

Example:
Data: 3, 7, 8, 15
\( \text{Range} = 15 - 3 = 12 \)

2. Variance

Definition: Variance measures the average squared deviation of each data point from the mean, providing insight into the data's spread.

Formulas:

  • Population: \( \sigma^2 = \frac{\sum (x_i - \mu)^2}{N} \)
  • Sample: \( s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1} \)

Example:
Data: 4, 8, 10
Calculate the mean: \( \bar{x} = \frac{4+8+10}{3} \approx 7.33 \)

Compute squared deviations:
\( (4 - 7.33)^2 \approx 11.11 \)
\( (8 - 7.33)^2 \approx 0.44 \)
\( (10 - 7.33)^2 \approx 7.11 \)
Sum of squared deviations: \( 11.11 + 0.44 + 7.11 \approx 18.66 \)

Population Variance: \( \sigma^2 \approx \frac{18.66}{3} \approx 6.22 \)
Sample Variance: \( s^2 \approx \frac{18.66}{2} \approx 9.33 \)

3. Standard Deviation

Definition: The standard deviation is the square root of the variance and is expressed in the same units as the data.

Formula: \( \text{Standard Deviation} = \sqrt{\text{Variance}} \)

Example (continuing from above):
If the variance is given as 8.33, then the standard deviation is:
\( \sqrt{8.33} \approx 2.89 \)

4. Percentiles

Definition: Percentiles divide the data into 100 equal parts. A specific percentile indicates the value below which that percentage of observations fall.

Example:
Data: 2, 4, 6
The 50th percentile (median) is 4.

5. Quartiles

Definition: Quartiles divide data into four equal parts:

  • Q1 (First Quartile): 25th percentile
  • Q2 (Second Quartile/Median): 50th percentile
  • Q3 (Third Quartile): 75th percentile

Interquartile Range (IQR):
\( \text{IQR} = Q3 - Q1 \)

Example:
Data: 5, 7, 8, 12, 15, 18, 22
Q1 = 7, Median = 12, Q3 = 18
\( \text{IQR} = 18 - 7 = 11 \)