Scikit-Learn: A Comprehensive Guide with Examples
1. Installation Setup
Ensure you have Python installed (preferably Python 3.7 or later). Install Scikit-Learn using:
pip install scikit-learn
Or using Anaconda:
conda install -c conda-forge scikit-learn
Verify installation:
import sklearn
print(sklearn.__version__)
2. Core Features of Scikit-Learn
- Supervised Learning: Classification and regression
- Unsupervised Learning: Clustering, dimensionality reduction
- Model Selection & Hyperparameter Tuning
- Preprocessing & Feature Engineering
- Evaluation Metrics
3. Loading Datasets
from sklearn import datasets
# Load the iris dataset
iris = datasets.load_iris()
# Print dataset keys
print(iris.keys())
# Extract features and target labels
X, y = iris.data, iris.target
print(X.shape, y.shape)
4. Data Preprocessing
4.1 Handling Missing Values
from sklearn.impute import SimpleImputer
import numpy as np
# Sample data with missing values
data = np.array([[1, 2, np.nan], [4, np.nan, 6], [7, 8, 9]])
# Use mean strategy to fill missing values
imputer = SimpleImputer(strategy="mean")
data_imputed = imputer.fit_transform(data)
print(data_imputed)
4.2 Feature Scaling (Standardization & Normalization)
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Min-Max Normalization
min_max_scaler = MinMaxScaler()
X_normalized = min_max_scaler.fit_transform(X)
5. Supervised Learning
5.1 Classification Example (Iris Dataset)
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a RandomForest Classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
# Predictions
y_pred = clf.predict(X_test)
# Evaluate Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
5.2 Regression Example (Boston Housing Dataset)
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Load dataset
boston = datasets.load_boston()
X, y = boston.data, boston.target
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
6. Unsupervised Learning
6.1 Clustering (K-Means)
from sklearn.cluster import KMeans
# Fit K-Means with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
# Print cluster centers
print(kmeans.cluster_centers_)
6.2 Principal Component Analysis (PCA)
from sklearn.decomposition import PCA
# Reduce to 2 dimensions
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
7. Model Selection & Hyperparameter Tuning
7.1 Cross-Validation
from sklearn.model_selection import cross_val_score
# Perform 5-fold cross-validation
scores = cross_val_score(clf, X, y, cv=5)
print(f"Cross-validation scores: {scores}")
7.2 Grid Search
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20]}
# Grid search
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=3)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
8. Model Evaluation
8.1 Classification Metrics
from sklearn.metrics import classification_report, confusion_matrix
# Confusion matrix
print(confusion_matrix(y_test, y_pred))
# Detailed classification report
print(classification_report(y_test, y_pred))
8.2 Regression Metrics
from sklearn.metrics import mean_absolute_error, r2_score
# Evaluate model
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Absolute Error: {mae:.2f}")
print(f"R² Score: {r2:.2f}")
9. Pipeline Automation
from sklearn.pipeline import Pipeline
# Define pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(n_estimators=100))
])
# Train and predict
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
10. Saving and Loading Models
import joblib
# Save model
joblib.dump(model, 'model.pkl')
# Load model
loaded_model = joblib.load('model.pkl')
Conclusion
Scikit-Learn is an essential tool for machine learning in Python, offering powerful tools for data preprocessing, modeling, evaluation, and optimization.