Information Retrieval Glossary

25 essential terms — because precise language is the foundation of clear thinking in Information Retrieval.

Showing 25 of 25 terms

Best Matching 25, a probabilistic ranking function incorporating term frequency saturation and document length normalization.

Related:TF-IDFOkapi BM25probabilistic retrieval

A retrieval model where queries are expressed as Boolean combinations of terms (AND, OR, NOT) and documents either match or do not.

Related:exact matchBoolean operatorsset theory

A similarity measure between two vectors computed as the cosine of the angle between them, commonly used to compare document and query vectors.

Related:vector space modeldot productsimilarity measure

The standard experimental methodology for IR evaluation using a test collection, queries, and relevance judgments.

Related:TRECtest collectionCyril Cleverdon

An approach to retrieval using learned dense vector embeddings rather than sparse term-based representations for semantic matching.

Related:neural IRembeddingsdual encoder

The science of searching for and obtaining relevant information from large data collections, encompassing the algorithms and systems behind search engines and digital libraries.

Related:search enginedocument retrievaltext mining

A data structure mapping terms to the documents and positions where they occur, enabling efficient full-text search.

Related:posting listindextokenization

A probabilistic model estimating the likelihood of a sequence of words, used in IR to rank documents by the probability of generating the query.

Related:probabilistic retrievalsmoothingquery likelihood

Reducing words to their dictionary base form (lemma) using linguistic analysis, more accurate than stemming.

Related:stemmingmorphological analysispart-of-speech tagging

An evaluation metric averaging precision values at each relevant document across a set of queries.

Related:precisionrecallNDCG

Normalized Discounted Cumulative Gain, an evaluation metric for ranked retrieval that supports graded relevance judgments.

Related:DCGMAPevaluation metrics

A link analysis algorithm that assigns importance scores to web pages based on the quantity and quality of incoming hyperlinks.

Related:HITSweb graphlink analysis

The list of documents (and optionally positions) associated with a particular term in an inverted index.

Related:inverted indextermdocument frequency

The proportion of retrieved documents that are relevant to the user's query.

Related:recallF1 scorefalse positive

The process of adding related terms to a query to improve recall by bridging vocabulary mismatches.

Related:relevance feedbackpseudo-relevance feedbackthesaurus

The proportion of all relevant documents in the collection that are successfully retrieved.

Related:precisionF1 scorefalse negative

The degree to which a retrieved document satisfies the user's information need. Can be binary or graded.

Related:relevance judgmentNDCGtopical relevance

Using explicit or implicit user judgments on retrieved documents to iteratively refine the query and improve results.

Related:Rocchio algorithmpseudo-relevance feedbackimplicit feedback

A classic relevance feedback method that adjusts the query vector toward relevant documents and away from non-relevant ones in the vector space model.

Related:relevance feedbackvector space modelquery reformulation

The process of reducing words to their morphological root form to improve term matching across inflectional variants.

Related:Porter Stemmerlemmatizationmorphology

Highly frequent function words (e.g., 'the', 'and', 'is') often removed during indexing to reduce noise and index size.

Related:text preprocessingtokenizationnoise reduction

A term weighting scheme combining term frequency (how often a term appears in a document) and inverse document frequency (how rare the term is across the collection).

Related:term frequencyinverse document frequencyBM25

The process of breaking text into individual units (tokens), typically words or subwords, as a first step in text processing.

Related:text preprocessingsegmentationn-gram

Text REtrieval Conference, an annual NIST-organized evaluation campaign benchmarking IR systems on shared tasks.

Related:Cranfield paradigmevaluationNIST

A model representing documents and queries as vectors in term space, using cosine similarity to measure relevance.

Related:cosine similarityTF-IDFGerard Salton