Data Science Glossary

A

  • Algorithm – A set of rules or instructions given to a computer to perform a task.
  • Artificial Intelligence (AI) – A branch of computer science that enables machines to simulate human intelligence.
  • Association Rule Mining – A technique used to discover interesting relationships between variables in large databases (e.g., Market Basket Analysis).
  • Anomaly Detection – The process of identifying rare events or observations that differ significantly from the majority of the data.
  • A/B Testing – A statistical method to compare two versions of a variable (e.g., a webpage) to determine which performs better.

B

  • Big Data – Large volumes of data that cannot be processed effectively using traditional methods.
  • Bias-Variance Tradeoff – The balance between underfitting (high bias) and overfitting (high variance) in model training.
  • Bayesian Statistics – A statistical method that incorporates prior knowledge when estimating probabilities.
  • Bagging (Bootstrap Aggregating) – An ensemble method that improves stability and accuracy in machine learning models by combining predictions from multiple models.

C

  • Clustering – Grouping similar data points together without predefined labels (e.g., K-Means, DBSCAN).
  • Classification – Assigning predefined labels to data points (e.g., Spam vs. Not Spam).
  • Cross-Validation – A technique used to evaluate machine learning models by splitting data into training and validation sets.
  • Confusion Matrix – A table used to evaluate the performance of a classification algorithm.

D

  • Data Cleaning – The process of fixing or removing incorrect, corrupted, or inconsistent data.
  • Data Engineering – The practice of designing and building systems to collect, store, and analyze data.
  • Dimensionality Reduction – Techniques to reduce the number of input variables (e.g., PCA, t-SNE).
  • Decision Tree – A model that makes decisions based on feature-based conditions in a tree-like structure.

E

  • Exploratory Data Analysis (EDA) – The process of analyzing and visualizing data to understand its characteristics before modeling.
  • Ensemble Learning – Combining multiple models to improve prediction performance (e.g., Random Forest, Gradient Boosting).
  • ETL (Extract, Transform, Load) – The process of gathering, transforming, and loading data into a system for analysis.

F

  • Feature Engineering – The process of creating new variables (features) from raw data to improve model performance.
  • Feature Selection – Identifying and selecting the most relevant features for a model.
  • F1 Score – A metric that balances precision and recall in classification models.
  • False Positive / False Negative – Incorrect classifications in binary classification models.

G

  • Gradient Descent – An optimization algorithm used to minimize the error of a machine learning model.
  • Generative Adversarial Networks (GANs) – A type of neural network used to generate new data similar to training data.
  • Gaussian Distribution – Also called a normal distribution, a probability distribution that is symmetric around its mean.

H

  • Hyperparameter Tuning – The process of optimizing the parameters that control the learning process of a machine learning model.
  • Hypothesis Testing – A statistical method used to test assumptions about data.

I

  • Imbalanced Data – A dataset where one class significantly outnumbers the other, leading to biased predictions.
  • Imputation – Replacing missing data with estimated values.

J

  • Jaccard Similarity – A metric used to measure similarity between two sets.

K

  • K-Means Clustering – A popular unsupervised learning algorithm for partitioning data into clusters.
  • K-Nearest Neighbors (KNN) – A classification algorithm that assigns labels based on the nearest data points.

L

  • Linear Regression – A statistical method for modeling the relationship between dependent and independent variables.
  • Logistic Regression – A classification algorithm used for binary outcomes.
  • Loss Function – A function used to measure how well a machine learning model is performing.

M

  • Machine Learning (ML) – The study of algorithms that allow computers to learn from data.
  • Mean Absolute Error (MAE) – A metric that measures the average absolute difference between actual and predicted values.
  • Mean Squared Error (MSE) – A metric that squares the differences between actual and predicted values to penalize large errors.
  • Model Overfitting – When a model learns noise instead of the actual pattern in data, performing well on training but poorly on unseen data.

N

  • Natural Language Processing (NLP) – A field of AI focused on the interaction between computers and human language.
  • Neural Network – A set of algorithms modeled after the human brain used for pattern recognition.
  • Normalization – The process of scaling features to have a standard range (e.g., between 0 and 1).

O

  • Outlier Detection – Identifying data points that significantly differ from the majority.
  • Overfitting – When a model learns noise in training data and fails to generalize to new data.

P

  • Principal Component Analysis (PCA) – A technique for reducing the dimensionality of data.
  • Precision – A metric that measures the accuracy of positive predictions in classification models.
  • Predictive Modeling – Using historical data to predict future outcomes.

Q

  • Quantile Regression – A type of regression that predicts specific percentiles instead of mean outcomes.
  • Query – A request to retrieve data from a database.

R

  • Random Forest – An ensemble learning method using multiple decision trees.
  • Reinforcement Learning – A type of machine learning where an agent learns by interacting with an environment.
  • ROC Curve (Receiver Operating Characteristic Curve) – A graphical representation of a classification model’s performance.

S

  • Standard Deviation – A measure of the amount of variation in a dataset.
  • Supervised Learning – A type of machine learning where the model is trained on labeled data.
  • Support Vector Machine (SVM) – A classification algorithm that finds the optimal boundary between classes.

T

  • Time Series Analysis – The study of data points collected or recorded at specific time intervals.
  • Tokenization – The process of splitting text into words or phrases for NLP applications.
  • True Positive / True Negative – Correctly classified outcomes in a classification model.

U

  • Unsupervised Learning – A type of machine learning where the model learns patterns without labeled outcomes.
  • Underfitting – When a model is too simple and fails to learn patterns in data.

V

  • Validation Set – A subset of the dataset used to evaluate model performance during training.
  • Variance – A measure of how much predictions fluctuate for different data samples.

W

  • Word Embeddings – Representing words as numerical vectors in NLP tasks.
  • Weighted Average – An average where some values contribute more to the final calculation than others.

X

  • XGBoost – A powerful gradient boosting algorithm.

Y

  • Y-Intercept – In linear regression, the point where the regression line crosses the y-axis.

Z

  • Z-Score – A statistical measure describing a value’s relationship to the mean.
  • Zero Shot Learning – A machine learning technique where the model predicts outputs for classes it has not seen before.