Data Engineering Glossary

25 essential terms — because precise language is the foundation of clear thinking in Data Engineering.

Showing 25 of 25 terms

An open-source platform to programmatically author, schedule, and monitor data workflows using directed acyclic graphs (DAGs).

A row-based data serialization framework with support for schema evolution and compact binary encoding.

A distributed stream processing framework designed for stateful computations over unbounded and bounded data streams.

A distributed event streaming platform for high-throughput, fault-tolerant, real-time data pipelines and streaming applications.

A columnar storage file format optimized for analytical workloads with efficient compression and encoding schemes.

A unified analytics engine for large-scale data processing supporting batch, streaming, SQL, machine learning, and graph computations.

A data processing model where data is collected over time and processed as a group at scheduled intervals.

A technique for detecting and propagating data changes from a source database to downstream systems in near-real time.

A metadata management tool providing a searchable inventory of data assets with descriptions, ownership, lineage, and quality metrics.

The framework of policies, standards, and processes ensuring data quality, security, privacy, and regulatory compliance.

A centralized storage repository that holds raw data in its native format, supporting schema-on-read access patterns.

An architecture combining the flexibility of data lakes with the performance and management features of data warehouses.

The tracking of data's origin, movement, and transformation across systems throughout its lifecycle.

The process of defining how data is structured, related, and stored, using approaches like star schema, snowflake schema, or Data Vault.

The ability to monitor and understand the health of data systems, encompassing freshness, volume, schema, distribution, and lineage.

An automated sequence of processing steps that moves and transforms data from sources to destinations.

A centralized, schema-enforced analytical data store optimized for complex queries and reporting on structured data.

An open-source command-line tool that enables analytics engineers to define SQL-based transformations inside a data warehouse.

Extract, Load, Transform — a pattern where raw data is loaded into the target system first and then transformed using the target's compute resources.

Extract, Transform, Load — a pattern where data is extracted from sources, transformed in a staging area, and then loaded into a target system.

The property of an operation that produces the same result whether executed once or multiple times, ensuring reliability in retries.

Dividing data into smaller segments based on a key to improve query performance and manageability.

The ability to modify a data schema over time without breaking existing consumers, supporting forward and backward compatibility.

A service that stores and manages schemas for streaming data, enforcing compatibility rules between data producers and consumers.

A data processing model where records are processed continuously as they arrive, enabling real-time or near-real-time analytics.