Chapter 6: Exploratory Data Analysis (EDA)
Introduction
Exploratory Data Analysis (EDA) is a crucial step in the data science workflow. It involves summarizing the main characteristics of a dataset, detecting patterns, and identifying anomalies before applying machine learning models. The goal of this chapter is to equip you with techniques for analyzing and visualizing data effectively.
6.1 Techniques for Detecting Patterns and Outliers
Detecting Patterns in Data
Identifying patterns in datasets helps uncover hidden trends, correlations, and dependencies between variables. Key techniques include:
Summary Statistics:
- Measures of central tendency: Mean, median, and mode.
- Measures of dispersion: Standard deviation, variance, and interquartile range (IQR).
import pandas as pd
df = pd.read_csv("data.csv")
print(df.describe()) # Provides summary statistics
Visualization Techniques:
- Histograms: Show distribution of a single variable.
- Boxplots: Help identify outliers and distribution shape.
- Scatterplots: Reveal relationships between two variables.
import seaborn as sns
import matplotlib.pyplot as plt
sns.pairplot(df) # Pairwise scatterplots
plt.show()
Correlation Analysis:
- Pearson correlation (linear relationships).
- Spearman correlation (monotonic relationships).
correlation_matrix = df.corr(method="pearson")
print(correlation_matrix)
Detecting Outliers
Z-Score Method:
from scipy import stats
z_scores = stats.zscore(df["column_name"])
outliers = df[abs(z_scores) > 3] # Identifies values > 3 standard deviations from the mean
print(outliers)
Interquartile Range (IQR):
Q1 = df["column_name"].quantile(0.25)
Q3 = df["column_name"].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df["column_name"] < (Q1 - 1.5 * IQR)) | (df["column_name"] > (Q3 + 1.5 * IQR))]
print(outliers)
Visualization of Outliers:
sns.boxplot(x=df["column_name"])
plt.show()
6.2 Data Cleaning and Preprocessing
Handling Missing Data
# Removing missing values
df.dropna(inplace=True)
# Filling missing values
df["column_name"].fillna(df["column_name"].mean(), inplace=True)
# Imputation Using Machine Learning
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
df["column_name"] = imputer.fit_transform(df[["column_name"]])
Handling Duplicates
df.drop_duplicates(inplace=True)
Handling Inconsistent Data
# Convert categorical values to lowercase
df["category_column"] = df["category_column"].str.lower()
# Remove whitespace
df["category_column"] = df["category_column"].str.strip()
6.3 Feature Engineering and Selection
Feature Engineering Techniques
# Creating new features
df["year"] = pd.to_datetime(df["date"]).dt.year
# Binning numerical values
df["income_group"] = pd.cut(df["income"], bins=[0, 30000, 60000, 100000], labels=["Low", "Medium", "High"])
Feature Selection Techniques
# Recursive Feature Elimination (RFE)
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
selector = RFE(model, n_features_to_select=5)
selector.fit(df.drop(columns=["target"]), df["target"])
print(selector.support_)
6.4 Case Studies in EDA
Case Study 1: Analyzing Customer Churn
- Dataset: Telecom churn dataset.
- Objective: Identify factors that contribute to customer churn.
- Techniques: Summary statistics, visualizing churn rates, feature engineering, outlier detection.
Case Study 2: Fraud Detection in Transactions
- Dataset: Credit card transactions.
- Objective: Identify fraudulent transactions.
- Techniques: Outlier detection, PCA for dimensionality reduction, correlation heatmaps.
Interactive Features
- EDA Challenge: Detect missing values, clean data, and engineer new features.
- Visualization Exercises: Generate histograms, scatter plots, and boxplots.
- Code Walkthroughs: Guided Python notebooks.
Note: to create data.csv run the following code and keep
the file path where you want to be saved as "data.csv" file
for the code for 6.1 Summary Statistics
-----------------------------------------------------------------------
import pandas as pd
import numpy as np
# Generate sample data for data.csv
np.random.seed(42)
data = {
"id": np.arange(1, 101),
"age": np.random.randint(18, 65, size=100),
"income": np.random.randint(25000, 120000, size=100),
"score": np.random.randint(1, 100, size=100),
"purchase_amount": np.random.uniform(10, 500, size=100),
"category": np.random.choice(["A", "B", "C", "D"], size=100)
}
df = pd.DataFrame(data)
# Save to CSV
file_path = "/mnt/data/data.csv"
df.to_csv(file_path, index=False)
# Provide the file for download
file_path
Conclusion
EDA is an essential step in any data science project, enabling us to understand data quality, detect patterns, and create meaningful features. Mastering these techniques will enhance your ability to build high-performing models.