Data Analytics Glossary
25 essential terms — because precise language is the foundation of clear thinking in Data Analytics.
Showing 25 of 25 terms
A controlled experiment that compares two variants by randomly assigning subjects to each group, used to determine which version performs better on a defined metric.
The strategies, technologies, and tools used to collect, integrate, analyze, and present business data to support better decision-making, typically through dashboards and reports.
An analytical technique that groups subjects by a shared characteristic or time period and tracks their behavior over time to identify trends and lifecycle patterns.
The framework of policies, roles, processes, and standards that ensures organizational data is managed consistently, securely, and in compliance with applicable regulations.
The process of replacing missing data values with substituted values using methods such as mean replacement, interpolation, k-nearest neighbors, or model-based approaches.
A storage system that holds vast amounts of raw data in its native format until needed, supporting schema-on-read and accommodating structured, semi-structured, and unstructured data.
The documentation of data's origins, movements, and transformations throughout its lifecycle within an organization's systems and processes.
The graphical representation of data through charts, graphs, maps, and dashboards to make complex information accessible and to reveal patterns, trends, and outliers.
A centralized repository that stores structured, processed data from multiple sources, optimized for analytical querying and reporting using a schema-on-write approach.
The tier of analytics that summarizes historical data using aggregation, visualization, and reporting to answer the question 'what happened.'
Techniques that reduce the number of variables in a dataset while preserving as much information as possible, such as PCA, to simplify models and mitigate the curse of dimensionality.
Extract, Transform, Load: a data integration process that extracts data from sources, transforms it into a clean and consistent format, and loads it into a target data store.
The process of using domain knowledge to create, select, or transform input variables from raw data to improve the performance of analytical or machine learning models.
Key Performance Indicator: a quantifiable metric used to evaluate how effectively an organization or process is achieving its strategic and operational objectives.
The default assumption in hypothesis testing that there is no effect or no difference between groups. Statistical tests attempt to gather evidence against the null hypothesis.
Online Analytical Processing: a category of systems designed for complex, multi-dimensional queries over large historical datasets, supporting operations like slicing, dicing, and drill-down.
A data point that differs significantly from other observations in a dataset. Outliers may indicate measurement errors, data entry mistakes, or genuinely extreme values that require investigation.
The probability of observing results at least as extreme as the data, assuming the null hypothesis is true. Lower p-values provide stronger evidence against the null hypothesis.
The use of statistical models and machine learning on historical data to forecast future outcomes and probabilities.
The most advanced analytics tier that recommends specific actions by combining predictive models with optimization and simulation techniques.
A statistical method for estimating relationships between a dependent variable and one or more independent variables, used for prediction and understanding variable influence.
Structured Query Language: the standard programming language for managing and querying relational databases, widely used for data retrieval, manipulation, and reporting in analytics.
A data warehouse modeling approach featuring a central fact table linked to surrounding dimension tables, resembling a star and optimized for analytical queries.
A determination that an observed result is unlikely to have occurred by chance, typically assessed using a p-value threshold of 0.05.
A logical error that occurs when analysis focuses only on entities that passed a selection process, ignoring those that did not, leading to overly optimistic or skewed conclusions.