Data Science Cheat Sheet
The core ideas of Data Science distilled into a single, scannable reference — perfect for review or quick lookup.
Quick Reference
Data Wrangling
The process of cleaning, transforming, and restructuring raw data into a usable format for analysis. Data wrangling often consumes the majority of a data scientist's time, as real-world data is messy, incomplete, and inconsistent.
Exploratory Data Analysis (EDA)
An approach to analyzing datasets by summarizing their main characteristics using statistical summaries and visualizations before applying formal modeling. EDA helps identify patterns, detect anomalies, and test assumptions about the data's structure.
Statistical Inference
The process of drawing conclusions about a population based on a sample of data, using probability theory to quantify uncertainty. It includes hypothesis testing, confidence intervals, and estimation of parameters.
Regression
A supervised learning technique that models the relationship between a dependent variable and one or more independent variables to predict continuous outcomes. Linear regression is the simplest form, but variants include polynomial, ridge, lasso, and logistic regression.
Classification
A supervised learning task where the goal is to assign input data to predefined categories or labels. Common algorithms include logistic regression, decision trees, random forests, support vector machines, and neural networks.
Clustering
An unsupervised learning technique that groups similar data points together without predefined labels. The algorithm discovers natural groupings in the data based on similarity metrics such as Euclidean distance.
Feature Engineering
The process of creating, selecting, and transforming input variables to improve a machine learning model's predictive performance. Good feature engineering often has a greater impact on results than choosing a more complex algorithm.
Cross-Validation
A resampling technique used to evaluate how well a model generalizes to unseen data by partitioning the dataset into complementary training and validation subsets. $k$-fold cross-validation splits data into $k$ groups, training on $k-1$ folds and testing on the remaining fold repeatedly.
A/B Testing
A controlled experiment comparing two versions of a variable (A and B) to determine which performs better on a defined metric. It relies on random assignment and statistical hypothesis testing to establish causal relationships between changes and outcomes.
Data Visualization
The graphical representation of data and information using charts, graphs, maps, and dashboards to communicate patterns, trends, and outliers effectively. Strong visualizations make complex analyses accessible to non-technical stakeholders.
Key Terms at a Glance
Get study tips in your inbox
We'll send you evidence-based study strategies and new cheat sheets as they're published.
We'll notify you about updates. No spam, unsubscribe anytime.