Retrieval-Augmented Generation (RAG): How Retrieval Algorithms Work

A comprehensive overview of retrieval algorithms in RAG.

In earlier articles in this series, we covered the preprocessing of data for RAG systems and the processes for creating, populating, and updating a vector database that functions as the knowledge base for our RAG system. In this final article on building a RAG system, we will conclude our series by focusing on the development of the component responsible for handling user queries. Specifically, we will explore how to retrieve the most relevant information from the knowledge base to generate accurate responses. While this step may appear like the simplest part of building an RAG system, setting up the retrieval engine involves several critical considerations to ensure optimal results.

What Are Retrieval Engines

A retrieval engine, often called a retriever, is the part of a RAG (Retrieval-Augmented Generation) system that handles user queries. Its operation can be broken down into a few key steps.

First, the user submits a query for the RAG system to process. Since RAG systems leverage generative models, these queries are sometimes referred to as prompts.

Next, the query is encoded using the same model employed to create embeddings stored in a vector database. This encoding process enables an effective search of the vector database for information relevant to the query.

In the third step, the query's vector is compared to the vectors stored in the database to identify the most relevant information. Typically, these are text chunks with vectors most similar to the query's vector. The system retrieves the top n chunks (e.g., the top 5 most similar) to enrich the original query.

Retrieval-Augmented Generation (RAG): How to Work with Vector Databases

During the fourth step, the retrieved information is added to the original query. This step usually expands the query by incorporating relevant data from the selected chunks. It also signals that this information should be used to generate the final answer.

Finally, the enriched query is used as a prompt for a large language model (LLM), which processes it and generates a response. Thus, the final output is a combination of the retrieved information and the generative capabilities of the LLM.

What Is the Difference between Naive versus Advanced Retrievers

A naive retriever is a basic model that simply compares the vector of the user’s query to those in a vector database and returns the text deemed most relevant. However, these types of retrievers are less common today. The reason behind this is that more modern approaches combine this basic method with additional techniques to ensure that the most relevant data is retrieved, even for complex or obscure queries.

For example, many modern retrievers use query rewriting, where a user’s query is rephrased to enhance the RAG system's ability to process and answer it. This rewriting is done using a Generative AI model, which could be the same model used for the final answer or a smaller, specialized model. The complexity of the query rewriting can vary based on the application's needs. It can go from simple rephrasing to more advanced methods like Hypothetical Document Embeddings (HyDE), multi-querying, and step-back questions, among others. However, to keep things simple, we will skip query rewriting in this article.

Intro to Generative AI: What Are Diffusion Models?

Instead, the focus falls on the second key improvement that distinguishes advanced retrievers from naive ones: enhancements to the retrieval process. Specifically, we will explore how to create a more accurate retrieval method, known as hybrid search, by developing an ensemble retriever.

Article continues below

Want to learn more? Check out some of our courses:

Intermediate Programming with Python

Learn More

AI in Computer Vision: From Fundamentals to Advanced Solutions

Learn More

Introduction to Deep Learning

Learn More

What Are Ensemble Retrievers

In the previous article, we discussed creating vector databases and highlighted their importance as the knowledge base for RAG systems. While this is true, it is not entirely accurate when building an advanced RAG system.

Vector databases are based on the principle that texts with similar meanings will produce similar embeddings when converted using the same embedding model. This type of search is often referred to as semantic search. To retrieve relevant information, we compare the query's embedding with the embeddings stored in the vector database. This is usually done using similarity measures like Euclidean distance or cosine similarity.

There is also the possibility of performing a so-called sparse search. This is similar to a keyword search, however, it uses more advanced algorithms for faster and more accurate word or phrase matching within the data.

Word Embeddings in Natural Language Processing: The Complete Guide

While sparse search might seem less effective than semantic search, both methods have their strengths and weaknesses. Semantic search excels at finding information with similar meanings, even if synonyms or different spellings are used in the prompt. However, there are times when this is far from ideal. For instance, let us imagine that you want to specifically search for information about the movie "The Day After Tomorrow". Semantic search might return results related to the phrase "the day after tomorrow" in general, rather than the movie. In such cases, sparse search is more effective because it looks for the exact phrase as entered.

Since it can be difficult to determine which type of search is best for a given query, modern retrievers often use a hybrid search approach. Hybrid search combines semantic search with sparse search, creating an ensemble retriever that leverages the strengths of both methods. This typically leads to more accurate and relevant results for the user's query.

Semantic search is something we already explained in the previous article, therefore there is no need to go over it again. Instead, we will cover the most popular sparse search algorithm, called BM25. Moreover, we will show how ensemble retrievers combine the results of the two searches using the so-called Reciprocal Rank Fusion(RRF).

What Is the BM25 Algorithm

The BM25 algorithm, also known as Okapi BM25, is the most commonly used sparse search algorithm. It is a ranking algorithm used in information retrieval and search engines to assess the relevance of documents to a given query and rank them accordingly.

BM25 builds upon the traditional TF-IDF (Term Frequency-Inverse Document Frequency) model by incorporating several factors:

Term Frequency (TF): Measures how often a specific term appears in a document.
Inverse Document Frequency (IDF): Indicates how common or rare a term is across all documents in the dataset.
Document Length: Includes a mechanism to prevent longer documents from dominating shorter ones unfairly.
Term Saturation: Unlike simple TF-IDF, BM25 accounts for the diminishing impact of term frequency on relevance as the frequency increases.

The BM25 equation is quite complex, so it will not be further elaborated here. However, there is no need to understand the equation because BM25 is already implemented by default in Langchain. This eliminates the need to code the search algorithm from scratch.

What Is Reciprocal Rank Fusion (RRF)

Reciprocal Rank Fusion (RRF) is a method used to combine rankings from multiple retrieval models into a single, unified ranking, particularly useful in Retrieval-Augmented Generation (RAG) systems. It modifies the information retrieval step of the RAG system by adding additional steps.

RRF consists these steps:

User Query: The user submits a query.
Multiple Retrievers: The query is processed by multiple retrieval models (e.g., dense, sparse, hybrid).
Individual Rankings: Each model generates a ranking of relevant documents.
Re-ranking These rankings are combined using the RRF formula.
Final Ranking: A single, unified ranking is produced based on the RRF scores.
Generation: The generative model uses the top-ranked documents to formulate the final answer.

In simple terms, when we retrieve, for example, the top 5 most relevant documents using semantic search and another 5 using BM25 search, the RRF algorithm will take these 10 documents and re-rank them. This re-ranking process creates a new list based on the importance of the information in each document for the query. From this new list, we can then select the top 5 documents to enrich our original query, combining the strengths of both search methods.

How to Create an Ensemble Retriever in Python with Langchain

Now that the concept and function of ensemble retrievers is clear, it is time to create one to integrate into our RAG system. First, we need to create our semantic retriever, then the BM25 retriever. Finally, we need to combine them into the ensemble retriever. To do so we will create a function called run_query() that is going to run a query using a combination of a semantic search and BM25 search. However, before creating that function, we need to create a template that will transform our original query into an enriched version to be passed to the LLM.

PROMPT_TEMPLATE = """
Answer the question based only on the following context:

{context}

---

Answer the question based on the above context: {question}
"""

Inside the run_query function, we will specify that the original query should be placed where {question} is, and the relevant information from the chunks should be inserted where {context} is.

Now, let's create the run_query function().

from langchain.docstore.document import Document
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate
from langchain_community.llms.ollama import Ollama
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever


def run_query(query_text, database_path):
    """
Run a query using a combination of semantic and BM25 search.

    Args:

        database_path(str): The path to the database.

        query_text (str): The query text to search for in the database.



    Returns:

        str: The generated response text from the model.

    """
    # Initialize the embedding function
    embedding_function = get_embeddings()

    # Initialize the Chroma database with the embeddings function
    db = Chroma(persist_directory=database_path,   embedding_function=embedding_function)

    # Initialize the semantic retriever
    semantic_retriever = db.as_retriever(search_kwargs={"k": 10})

    # Perform the semantic search to get relevant documents
    retrieved_docs = semantic_retriever.get_relevant_documents(query_text)

    # Extract the sources from the retrieved documents
    sources = {doc.metadata['source'] for doc in retrieved_docs}

    # Construct the where clause using the $in operator
    where_clause = {"source": {"$in": list(sources)}}
    bm25_collection = db.get(where=where_clause, include=["documents"])
    bm25_chunks = [Document(page_content=text) for text in bm25_collection['documents']]

    # Initialize the keyword retriever (e.g., top 3 results)
    bm25_retriever = BM25Retriever.from_documents(bm25_chunks)
    bm25_retriever.k = 3

    # Combine the results of semantic retrieval and BM25 search
    ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, semantic_retriever], weights=[0.3, 0.7])

    # Retrieve the top-k relevant documents for the query text
    results = ensemble_retriever.invoke(query_text)

    # Create the enriched prompt by joining the content of the retrieved documents
    context_text = "\n\n---\n\n".join([doc.page_content for doc in results])
    # Initialize the prompt template and format it with the context and query
    prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
    prompt = prompt_template.format(context=context_text, question=query_text)

    # Initialize the model and invoke it with the formatted prompt
    model = Ollama(model="llama3:8b")
    response_text = model.invoke(prompt)

    # Extract the sources from the retrieved documents
    sources = [doc.metadata.get("id") for doc in results if doc.metadata.get("id") is not None]
    # Format the response with the generated text and sources
    formatted_response = f"Response: {response_text}\nSources: {sources}"

    return formatted_response

Let’s break down the code in this function. The first thing we do is initialize the get_embeddings() function inside of our run_query() function.

# Initialize the embedding function
embedding_function = get_embeddings()

This function, which was already defined in the previous article, plays a pivotal role. It will convert the user query into an embedding, which we can then compare to the embeddings in our database to identify the most similar ones. Essentially, it is an integral part of our semantic retriever.

Python Cheat Sheet

The semantic retriever will be defined next. Creating a semantic retriever with Langchain is not particularly intuitive. Instead of creating a separate retriever, we initialize our database and specify that we want to use it as a retriever using the as_retriever() function. The key parameter to define when creating this semantic retriever is the value of the search_kwargs argument. In my code, I set it to {"k": 10}. This ensures that when I convert my query into an embedding and use the database as the semantic retriever, I get back the 10 chunks with embeddings most similar to my query’s embedding.

# Initialize the Chroma database with the embeddings function
db = Chroma(persist_directory=database_path, embedding_function=embedding_function)

# Initialize the semantic retriever
semantic_retriever = db.as_retriever(search_kwargs={"k": 10})

Once this retriever is defined, we can get these 10 relevant chunks using the get_relevant_documents() method of the semantic retriever. Please note that this DOES NOT return the PDF files that were the original source of information, but the chunks we created earlier from different PDF files, that are stored in our database.

# Perform the semantic search to get relevant documents
retrieved_docs = semantic_retriever.get_relevant_documents(query_text)

This completes the setup for our semantic retriever. Next, we will shift our focus to running the BM25 retriever. To optimize efficiency, the BM25 retriever's search space will be limited. We will assume that our semantic retriever is effective enough to pinpoint which PDFs contain relevant information. As a result, we will only run the BM25 retriever on chunks from those identified files.

For example, let's say our database consists of chunks extracted from 10 different PDF files. After running the semantic retriever using a query about poker, the top 10 chunks returned are from just 3 of these files: “poker rules.pdf,” “game rules.pdf,” and “gambling games.pdf.” Instead of applying the BM25 algorithm to all chunks from the 10 PDF files, we will restrict it to the chunks from the 3 files identified by the semantic retriever. This approach narrows the algorithm's search space, ensuring the BM25 search focuses on relevant data. It also avoids processing chunks that are unlikely to be connected to our query. Thus, we use the semantic retriever not only to rank chunks by relevance but also to determine which chunks the BM25 algorithm should analyze.

The Ultimate Guide to Data Processing with Python

To identify the PDF files associated with the chunks selected by the semantic retriever, we can examine the metadata of each chunk, specifically the "source" field. This field indicates the PDF file from which the chunk was extracted.

# Extract the sources from the retrieved documents
sources = {doc.metadata['source'] for doc in retrieved_docs}

Once the sources are identified, we can construct a simple where clause that allows us to retrieve all chunks from our database that belong to these identified sources. These chunks will be returned as a list of Document objects, which is the format the BM25 algorithm in Langchain expects as input.

# Construct the where clause using the $in operator
where_clause = {"source": {"$in": list(sources)}}
# Extract relevant chunks
bm25_collection = db.get(where=where_clause, include=["documents"])
bm25_chunks = [Document(page_content=text) for text in bm25_collection['documents']]

Finally, it is time to set up our BM25 retriever and run it on our data. We will then retrieve the top 10 chunks that the BM25 retriever identifies as the most relevant.

# Initialize the keyword retriever (e.g., top 10 results)
bm25_retriever = BM25Retriever.from_documents(bm25_chunks)
bm25_retriever.k = 10

Now that both of our retrievers are ready, we can combine their results by creating an ensemble retriever. Since each retriever returned 10 chunks it believes are most relevant, we have 20 chunks in total in contention. What the ensemble retriever is going to do is re-rank those chunks using the RRF algorithm to produce a unified ranking.

Introduction to Prompt Engineering

While we could let the algorithm run as is, it is usually a good idea to assign weights to different algorithms. This means we can instruct the ensemble retriever to prioritize one algorithm over the other. Since semantic search tends to perform better in most scenarios than BM25, we will assign a weight of 0.3 to BM25 and 0.7 to the semantic search algorithm. This way, the ensemble retriever will favor chunks returned by the semantic retriever. However, it will still allow chunks from BM25 to appear in the top 10 of the unified ranking if they are highly relevant.

# Combine the results of semantic retrieval and BM25 search
ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, semantic_retriever], weights=[0.3, 0.7])

# Retrieve the top-k relevant documents for the query text
results = ensemble_retriever.invoke(query_text)

The next step is to insert the top-ranked chunks into our query template. This ensures that the query sent to the LLM includes this extra information as a reference, helping the LLM provide a more accurate and improved response.

# Create the enriched prompt by joining the content of the retrieved documents
context_text = "\n\n---\n\n".join([doc.page_content for doc in results])
# Initialize the prompt template and format it with the context and query
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
prompt = prompt_template.format(context=context_text, question=query_text)

The final step is to pass the enriched query to our LLM to obtain the answer. In this example, I have added extra logic to the code to ensure that, along with the answer to our question, we also receive information about the sources the model used. Specifically, we will see which chunks were referenced in generating the answer.

# Initialize the model and invoke it with the formatted prompt
model = Ollama(model="llama3:8b")
response_text = model.invoke(prompt)

# Extract the sources from the retrieved documents
sources = [doc.metadata.get("id") for doc in results if doc.metadata.get("id") is not None]
# Format the response with the generated text and sources
formatted_response = f"Response: {response_text}\nSources: {sources}"

return formatted_response

Now, with this function, we can pose a query to our LLM, and it will generate an answer based on the provided information. For example, since my database contains PDF files with details about various games, I asked the following question referencing the database location defined in the previous article of this series:

run_query("What is the Third Man Walking rule in Poker?", "CHROMA")

This is the answer received from my RAG system:

Response: According to the provided context, the "Third Man Walking" rule states that any 
Player who gets up from their seat in a cash game after two other 
Players are already away from the table will be considered the third man walking. 
This applies to five-handed or more games in a new or must-move game.

The RAG also returned a list of sources, but I will not include them here because they are quite lengthy. Moreover, they would not mean anything to the reader as they are simply file paths on my computer.

This article concludes our series on building a RAG system from the ground up. Throughout the series, we examined how RAGs operate, the key components involved, and how to construct an advanced RAG using primarily Langchain. As the field of LLMs rapidly evolves, new methods to improve RAG performance will likely emerge. However, this series provides a solid foundation for creating your own RAG system, which you can use to streamline your daily work.

Retrieval-Augmented Generation (RAG): How Retrieval Algorithms Work

What Are Retrieval Engines

What Is the Difference between Naive versus Advanced Retrievers

Want to learn more? Check out some of our courses:

What Are Ensemble Retrievers

What Is the BM25 Algorithm

What Is Reciprocal Rank Fusion (RRF)

How to Create an Ensemble Retriever in Python with Langchain

Data Science Trainer

Boris Delovski

Intro to Programming: What Are Integer Numbers and Floating-Point Number Data Types in Python?

5 Machine Learning Security Risks You Should Watch Out For

What is a Data Science Portfolio? (Part 1)

Retrieval-Augmented Generation (RAG): How Retrieval Algorithms Work

What Are Retrieval Engines

What Is the Difference between Naive versus Advanced Retrievers

Want to learn more? Check out some of our courses:

What Are Ensemble Retrievers

What Is the BM25 Algorithm

What Is Reciprocal Rank Fusion (RRF)

How to Create an Ensemble Retriever in Python with Langchain

Data Science Trainer

Boris Delovski

Read Next

Intro to Programming: What Are Integer Numbers and Floating-Point Number Data Types in Python?

5 Machine Learning Security Risks You Should Watch Out For

What is a Data Science Portfolio? (Part 1)