Statistics — Distribution continuous, Outliers interquartile (extended) Cheat Sheet

The core ideas of Statistics — Distribution continuous, Outliers interquartile (extended) distilled into a single, scannable reference — perfect for review or quick lookup.

PiqCue — piqcue.com/statistics-part-2/cheatsheet

Quick Reference

Mean, Median, and Mode

The three primary measures of central tendency. The mean is the arithmetic average, the median is the middle value when data are ordered, and the mode is the most frequently occurring value. Each measure captures a different aspect of a dataset's center.

Standard Deviation

A measure of the spread or dispersion of a dataset relative to its mean. It is calculated as the square root of the variance, which is the average of squared deviations from the mean. A low standard deviation indicates data points cluster near the mean, while a high value indicates greater spread.

Normal Distribution

A symmetric, bell-shaped probability distribution defined by its mean $\mu$ and standard deviation $\sigma$. It is fundamental to statistics because of the Central Limit Theorem, which states that sample means tend toward a normal distribution regardless of the population's shape. Approximately 68% of data fall within one standard deviation of the mean, 95% within two, and 99.7% within three.

Hypothesis Testing

A formal procedure for using sample data to evaluate claims about a population. The process involves stating a null hypothesis (no effect or no difference) and an alternative hypothesis, calculating a test statistic, and determining whether the evidence is strong enough to reject the null hypothesis at a chosen significance level.

P-Value

The probability of obtaining results at least as extreme as the observed results, assuming the null hypothesis is true. A small p-value suggests that the observed data are unlikely under the null hypothesis, providing evidence against it. It does not measure the probability that the null hypothesis is true.

Confidence Intervals

A range of values, derived from sample data, that is likely to contain the true population parameter with a specified level of confidence. A 95% confidence interval means that if the same sampling procedure were repeated many times, approximately 95% of the constructed intervals would contain the true parameter.

Regression Analysis

A set of statistical methods for estimating the relationship between a dependent variable and one or more independent variables. Linear regression fits a straight line to the data, while multiple regression and nonlinear regression handle more complex relationships. It is widely used for prediction and understanding causal factors.

Correlation

A statistical measure that quantifies the strength and direction of the linear relationship between two variables. The Pearson correlation coefficient ranges from $-1$ (perfect negative correlation) to $+1$ (perfect positive correlation), with 0 indicating no linear relationship. Correlation does not imply causation.

Sampling Methods

Techniques for selecting a subset of individuals from a population to estimate characteristics of the whole group. Common methods include simple random sampling, stratified sampling, cluster sampling, and systematic sampling. Proper sampling is essential for making valid inferences and avoiding bias.

Bayesian Statistics

An approach to statistics that incorporates prior knowledge or beliefs along with observed data to update the probability of a hypothesis. Using Bayes' theorem, the posterior probability is calculated by combining the prior probability with the likelihood of the observed data. This framework is especially useful when prior information is available or sample sizes are small.

Key Terms at a Glance

Alternative Hypothesis:The hypothesis that contradicts the null hypothesis, typically representing the researcher's claim that an effect, difference, or relationship exists in the population.

ANOVA:Analysis of Variance; a statistical method that tests whether the means of three or more groups are significantly different by comparing between-group and within-group variability using the F-statistic.

Bayesian Inference:A method of statistical inference that uses Bayes' theorem to update the probability of a hypothesis as new data become available, combining prior beliefs with observed evidence.

Bias:A systematic error in data collection, analysis, or interpretation that causes results to deviate from the true population values. Common forms include selection bias, measurement bias, and confirmation bias.

Central Limit Theorem:A fundamental theorem stating that the distribution of sample means approximates a normal distribution as the sample size increases, regardless of the shape of the population distribution.

Chi-Square Test:A non-parametric test used to assess the association between categorical variables or to compare observed frequencies with expected frequencies under a specified hypothesis.

Confidence Interval:A range of values, computed from sample data, that is expected to contain the true population parameter with a specified probability (e.g., 95%).

Correlation Coefficient:A numerical measure of the strength and direction of the linear relationship between two variables, most commonly the Pearson coefficient ($r$), which ranges from $-1$ to $+1$.

Degrees of Freedom:The number of independent values in a statistical calculation that are free to vary. Degrees of freedom affect the shape of test statistic distributions such as the t-distribution and chi-square distribution.

Effect Size:A quantitative measure of the magnitude of a phenomenon or the strength of a relationship. Common measures include Cohen's $d$, Pearson's $r$, and eta-squared. Unlike p-values, effect sizes convey practical significance.

Histogram:A graphical representation of the distribution of continuous data, where the data are divided into bins and the height of each bar represents the frequency or relative frequency of observations in that bin.

Hypothesis Testing:A formal statistical procedure for making decisions about population parameters by evaluating sample evidence against a null hypothesis, using test statistics and p-values.

Interquartile Range:The difference between the third quartile (75th percentile) and the first quartile (25th percentile), representing the spread of the middle 50% of the data. It is resistant to outliers.

Mean:The arithmetic average of a set of values, calculated by summing all values and dividing by the number of observations. It is the most widely used measure of central tendency.

Median:The middle value in an ordered dataset. For an even number of observations, the median is the average of the two central values. It is robust to extreme values and skewed data.

Get study tips in your inbox

We'll send you evidence-based study strategies and new cheat sheets as they're published.

We'll notify you about updates. No spam, unsubscribe anytime.