Data Science Cheat Sheet

The core ideas of Data Science distilled into a single, scannable reference — perfect for review or quick lookup.

PiqCue — piqcue.com/data-science/cheatsheet

Quick Reference

Data Wrangling

The process of cleaning, transforming, and restructuring raw data into a usable format for analysis. Data wrangling often consumes the majority of a data scientist's time, as real-world data is messy, incomplete, and inconsistent.

Exploratory Data Analysis (EDA)

An approach to analyzing datasets by summarizing their main characteristics using statistical summaries and visualizations before applying formal modeling. EDA helps identify patterns, detect anomalies, and test assumptions about the data's structure.

Statistical Inference

The process of drawing conclusions about a population based on a sample of data, using probability theory to quantify uncertainty. It includes hypothesis testing, confidence intervals, and estimation of parameters.

Regression

A supervised learning technique that models the relationship between a dependent variable and one or more independent variables to predict continuous outcomes. Linear regression is the simplest form, but variants include polynomial, ridge, lasso, and logistic regression.

Classification

A supervised learning task where the goal is to assign input data to predefined categories or labels. Common algorithms include logistic regression, decision trees, random forests, support vector machines, and neural networks.

Clustering

An unsupervised learning technique that groups similar data points together without predefined labels. The algorithm discovers natural groupings in the data based on similarity metrics such as Euclidean distance.

Feature Engineering

The process of creating, selecting, and transforming input variables to improve a machine learning model's predictive performance. Good feature engineering often has a greater impact on results than choosing a more complex algorithm.

Cross-Validation

A resampling technique used to evaluate how well a model generalizes to unseen data by partitioning the dataset into complementary training and validation subsets. $k$-fold cross-validation splits data into $k$ groups, training on $k-1$ folds and testing on the remaining fold repeatedly.

A/B Testing

A controlled experiment comparing two versions of a variable (A and B) to determine which performs better on a defined metric. It relies on random assignment and statistical hypothesis testing to establish causal relationships between changes and outcomes.

Data Visualization

The graphical representation of data and information using charts, graphs, maps, and dashboards to communicate patterns, trends, and outliers effectively. Strong visualizations make complex analyses accessible to non-technical stakeholders.

Key Terms at a Glance

A/B Testing:A controlled experiment comparing two variants to determine which performs better on a specified metric, using random assignment and statistical hypothesis testing.

Bagging:Bootstrap Aggregating; an ensemble method that trains multiple models on random subsets of the training data and combines their predictions to reduce variance.

Bias:The error introduced by approximating a complex real-world problem with a simplified model. High bias leads to underfitting.

Classification:A supervised learning task that assigns input data points to predefined categorical labels based on learned patterns from training data.

Clustering:An unsupervised learning technique that groups data points into clusters based on similarity, without using predefined labels.

Confidence Interval:A range of values, derived from sample statistics, that is likely to contain the true population parameter with a specified probability (e.g., 95%).

Cross-Validation:A resampling method that partitions data into training and validation subsets multiple times to estimate model performance on unseen data.

Data Pipeline:An automated series of processes that extract, transform, and load data from source systems to analytical destinations.

Data Wrangling:The process of cleaning, restructuring, and enriching raw data into a format suitable for analysis and modeling.

Dimensionality Reduction:Techniques that reduce the number of input variables in a dataset while preserving as much information as possible, such as PCA and t-SNE.

Ensemble Methods:Techniques that combine multiple models to produce a prediction that is more accurate and robust than any individual model, including bagging, boosting, and stacking.

Feature Engineering:The process of creating, transforming, and selecting input variables to improve the predictive performance of machine learning models.

Gradient Descent:An iterative optimization algorithm that adjusts model parameters by moving in the direction of the steepest decrease of the loss function.

Hypothesis Testing:A statistical method for making inferences about a population by testing an assumption (null hypothesis) against observed sample data.

Imputation:The process of replacing missing data with substituted values, using methods such as mean, median, mode, or model-based prediction.

Get study tips in your inbox

We'll send you evidence-based study strategies and new cheat sheets as they're published.

We'll notify you about updates. No spam, unsubscribe anytime.