Scikit-Learn: A Comprehensive Guide with Examples

1. Installation Setup

Ensure you have Python installed (preferably Python 3.7 or later). Install Scikit-Learn using:

pip install scikit-learn

Or using Anaconda:

conda install -c conda-forge scikit-learn

Verify installation:

import sklearn
print(sklearn.__version__)

2. Core Features of Scikit-Learn

  • Supervised Learning: Classification and regression
  • Unsupervised Learning: Clustering, dimensionality reduction
  • Model Selection & Hyperparameter Tuning
  • Preprocessing & Feature Engineering
  • Evaluation Metrics

3. Loading Datasets

from sklearn import datasets

# Load the iris dataset
iris = datasets.load_iris()

# Print dataset keys
print(iris.keys())

# Extract features and target labels
X, y = iris.data, iris.target
print(X.shape, y.shape)

4. Data Preprocessing

4.1 Handling Missing Values

from sklearn.impute import SimpleImputer
import numpy as np

# Sample data with missing values
data = np.array([[1, 2, np.nan], [4, np.nan, 6], [7, 8, 9]])

# Use mean strategy to fill missing values
imputer = SimpleImputer(strategy="mean")
data_imputed = imputer.fit_transform(data)
print(data_imputed)

4.2 Feature Scaling (Standardization & Normalization)

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Min-Max Normalization
min_max_scaler = MinMaxScaler()
X_normalized = min_max_scaler.fit_transform(X)

5. Supervised Learning

5.1 Classification Example (Iris Dataset)

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a RandomForest Classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Predictions
y_pred = clf.predict(X_test)

# Evaluate Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

5.2 Regression Example (Boston Housing Dataset)

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load dataset
boston = datasets.load_boston()
X, y = boston.data, boston.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

6. Unsupervised Learning

6.1 Clustering (K-Means)

from sklearn.cluster import KMeans

# Fit K-Means with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Print cluster centers
print(kmeans.cluster_centers_)

6.2 Principal Component Analysis (PCA)

from sklearn.decomposition import PCA

# Reduce to 2 dimensions
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

print(f"Explained variance ratio: {pca.explained_variance_ratio_}")

7. Model Selection & Hyperparameter Tuning

7.1 Cross-Validation

from sklearn.model_selection import cross_val_score

# Perform 5-fold cross-validation
scores = cross_val_score(clf, X, y, cv=5)
print(f"Cross-validation scores: {scores}")

7.2 Grid Search

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20]}

# Grid search
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=3)
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")

8. Model Evaluation

8.1 Classification Metrics

from sklearn.metrics import classification_report, confusion_matrix

# Confusion matrix
print(confusion_matrix(y_test, y_pred))

# Detailed classification report
print(classification_report(y_test, y_pred))

8.2 Regression Metrics

from sklearn.metrics import mean_absolute_error, r2_score

# Evaluate model
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error: {mae:.2f}")
print(f"R² Score: {r2:.2f}")

9. Pipeline Automation

from sklearn.pipeline import Pipeline

# Define pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=100))
])

# Train and predict
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

10. Saving and Loading Models

import joblib

# Save model
joblib.dump(model, 'model.pkl')

# Load model
loaded_model = joblib.load('model.pkl')

Conclusion

Scikit-Learn is an essential tool for machine learning in Python, offering powerful tools for data preprocessing, modeling, evaluation, and optimization.