Chapter 6: Exploratory Data Analysis (EDA)

Introduction

Exploratory Data Analysis (EDA) is a crucial step in the data science workflow. It involves summarizing the main characteristics of a dataset, detecting patterns, and identifying anomalies before applying machine learning models. The goal of this chapter is to equip you with techniques for analyzing and visualizing data effectively.

6.1 Techniques for Detecting Patterns and Outliers

Detecting Patterns in Data

Identifying patterns in datasets helps uncover hidden trends, correlations, and dependencies between variables. Key techniques include:

Summary Statistics:
  • Measures of central tendency: Mean, median, and mode.
  • Measures of dispersion: Standard deviation, variance, and interquartile range (IQR).
import pandas as pd

df = pd.read_csv("data.csv")
print(df.describe())  # Provides summary statistics
Visualization Techniques:
  • Histograms: Show distribution of a single variable.
  • Boxplots: Help identify outliers and distribution shape.
  • Scatterplots: Reveal relationships between two variables.
import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(df)  # Pairwise scatterplots
plt.show()
Correlation Analysis:
  • Pearson correlation (linear relationships).
  • Spearman correlation (monotonic relationships).
correlation_matrix = df.corr(method="pearson")
print(correlation_matrix)
Detecting Outliers
Z-Score Method:
from scipy import stats

z_scores = stats.zscore(df["column_name"])
outliers = df[abs(z_scores) > 3]  # Identifies values > 3 standard deviations from the mean
print(outliers)
Interquartile Range (IQR):
Q1 = df["column_name"].quantile(0.25)
Q3 = df["column_name"].quantile(0.75)
IQR = Q3 - Q1

outliers = df[(df["column_name"] < (Q1 - 1.5 * IQR)) | (df["column_name"] > (Q3 + 1.5 * IQR))]
print(outliers)
Visualization of Outliers:
sns.boxplot(x=df["column_name"])
plt.show()

6.2 Data Cleaning and Preprocessing

Handling Missing Data
# Removing missing values
df.dropna(inplace=True)

# Filling missing values
df["column_name"].fillna(df["column_name"].mean(), inplace=True)

# Imputation Using Machine Learning
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
df["column_name"] = imputer.fit_transform(df[["column_name"]])
Handling Duplicates
df.drop_duplicates(inplace=True)
Handling Inconsistent Data
# Convert categorical values to lowercase
df["category_column"] = df["category_column"].str.lower()

# Remove whitespace
df["category_column"] = df["category_column"].str.strip()

6.3 Feature Engineering and Selection

Feature Engineering Techniques
# Creating new features
df["year"] = pd.to_datetime(df["date"]).dt.year

# Binning numerical values
df["income_group"] = pd.cut(df["income"], bins=[0, 30000, 60000, 100000], labels=["Low", "Medium", "High"])
Feature Selection Techniques
# Recursive Feature Elimination (RFE)
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
selector = RFE(model, n_features_to_select=5)
selector.fit(df.drop(columns=["target"]), df["target"])
print(selector.support_)

6.4 Case Studies in EDA

Case Study 1: Analyzing Customer Churn
  • Dataset: Telecom churn dataset.
  • Objective: Identify factors that contribute to customer churn.
  • Techniques: Summary statistics, visualizing churn rates, feature engineering, outlier detection.
Case Study 2: Fraud Detection in Transactions
  • Dataset: Credit card transactions.
  • Objective: Identify fraudulent transactions.
  • Techniques: Outlier detection, PCA for dimensionality reduction, correlation heatmaps.
Interactive Features
  • EDA Challenge: Detect missing values, clean data, and engineer new features.
  • Visualization Exercises: Generate histograms, scatter plots, and boxplots.
  • Code Walkthroughs: Guided Python notebooks.

    Note: to create data.csv run the following code and keep 
          the file path where you want to be saved as "data.csv" file
          for the code for 6.1 Summary Statistics
   -----------------------------------------------------------------------
    import pandas as pd
    import numpy as np

    # Generate sample data for data.csv
    np.random.seed(42)
    data = {
    "id": np.arange(1, 101),
    "age": np.random.randint(18, 65, size=100),
    "income": np.random.randint(25000, 120000, size=100),
    "score": np.random.randint(1, 100, size=100),
    "purchase_amount": np.random.uniform(10, 500, size=100),
    "category": np.random.choice(["A", "B", "C", "D"], size=100)
    }

    df = pd.DataFrame(data)

    # Save to CSV
    file_path = "/mnt/data/data.csv"
    df.to_csv(file_path, index=False)

    # Provide the file for download
    file_path

Conclusion

EDA is an essential step in any data science project, enabling us to understand data quality, detect patterns, and create meaningful features. Mastering these techniques will enhance your ability to build high-performing models.