Introduction to RAG: Retrieval Augmented Generation

A comprehensive introduction to RAG systems.
By Boris Delovski • Updated on Aug 5, 2024
blog image

Large Language Models (LLMs) are currently a hot topic in the AI community, and for good reason. These models are the first highly advanced tools that are accessible and easy to use. They are for everyone, even the users completely unfamiliar with Deep Learning. The potential of using advanced models like ChatGPT and similar models to manage numerous mundane and repetitive tasks is appealing to many. Unfortunately, while the idea is attractive, it rarely translates into a functional pipeline that effectively automates these tasks. This is primarily because of one major obstacle: hallucinations. Hallucinations refer to factually incorrect or nonsensical responses generated by an LLM.

LLMs are prone to these errors, especially when not carefully curated to ensure the accuracy of their responses. Therefore, for us to rely on the outputs of an LLM, we need a way to verify that the information provided is correct. This need resulted in the creation of Retrieval Augmented Generation (RAG) systems. These systems enhance the knowledge of an LLM with external information. Afterward, this can be used by the LLM to generate its responses. By relying on this external source of information, we minimize the chances of the LLM producing factually incorrect responses. This reliability allows us to integrate the LLM into a pipeline for automating certain tasks.

What Is Retrieval Augmented Generation (RAG)

RAG systems, as mentioned in the introduction, combine two processes: information retrieval and natural language generation. Their necessity arises from the training way of LLMs. LLMs are trained on vast amounts of data from various sources, which makes them excellent general-purpose models. However, the struggle appears when they are asked about information not included in their training data. This includes recent developments or highly specific domain knowledge.

The issue of outdated information can be somewhat reduced by periodically fine-tuning the model with newly acquired data. However, there is still the problem of particularly specific information from niche domains that often remains poorly represented. In other words, we cannot rely on the models to produce accurate information in these areas.  RAG systems were developed exactly to address this problem.

The concept behind RAG systems is quite straightforward. We do not depend solely on the information the model learned during its initial training. Instead, we provide additional relevant information to enhance the model's responses. This way the model can use such extra information when formulating answers. This approach reduces the need to fine-tune models with niche data, a costly task for the majority of individuals and even an abundance of large companies. Simultaneously, it ensures that the model will produce accurate responses, even when asked about unfamiliar topics.

How Does Retrieval Augmented Generation (RAG) Work on a High-Level

It is already known that an RAG system is essentially a combination of two distinct algorithms: an information retrieval algorithm and a Large Language Model (LLM). The operation of a RAG system can be broken down into four steps:

1.    The user submits a query or question they want the system to answer.
2.    The information retrieval algorithm searches through a database of documents to gather information related to the user's query.
3.    The original query is combined with the collected information to create a more detailed prompt.
4.    The LLM receives this prompt and then generates an answer.

For example, let us say that we start with a query such as:

"Is green tea healthy?". 

This query will then be used by the information retrieval algorithm to collect information about green tea. For instance: green tea is rich in antioxidants such as catechins, research indicates that drinking tea can improve brain function and increase fat burning, green tea consumption has been linked with a reduced risk of certain cancers and it can also improve cardiovascular health. This information will be combined with the original query, creating an elaborated version of it, such as: 

"Considering that green tea contains antioxidants such as catechins, can improve brain function, increase fat burning, reduce the risk of certain cancers, and enhance cardiovascular health, is drinking green tea considered healthy??"

This enhanced prompt ensures that the LLM incorporates all the additional information when generating its answer. As a result, we receive a more accurate response and reduce the likelihood of LLM hallucinations.

Article continues below

What Is the Information Retrieval Algorithm

The process of information retrieval goes beyond just looking at a database and returning all information related to the topic at hand. In practice, this is slightly more nuanced. To be more precise, to retrieve information relevant to our query we need to:

•    divide the information in our documents into chunks
•    create embeddings for these chunks
•    use a ranking algorithm to select the chunks that are most closely related to our query 

What Is Chunking 

Documents stored in our database are usually divided into multiple parts called chunks by a chunking algorithm. This occurs because we aim to prevent the information retrieval algorithm from returning an entire document from our dataset when we generate a query. We only need the part of the document that is most closely connected to our query. Many different chunking algorithms can be used to separate documents into chunks. These are the most popular ones:

•    fixed-length chunking 
•    sliding window chunking
•    semantic chunking
•    hierarchical chunking
•    text tiling

Fixed-length chunking splits documents into chunks of some predetermined size. As an illustration, we can define that we intend to select 256 tokens at a time and separate them as a chunk. This type of chunking is especially simple and fast. However, in most cases, this is not useful because it completely ignores contextual information. Moreover, at times it even separates data that should be kept together.

Sliding window chunking uses overlapping windows to create chunks, preserving more context at the boundaries. It balances context retention with manageable chunk sizes. It produces better results than fixed-length chunking. Yet, it is still not one of the better algorithms.

Semantic chunking utilizes natural language processing techniques to split documents based on semantic units like sentences, paragraphs, or topics. This ensures chunks are contextually coherent. This is one of the better chunking algorithms because it ensures that context is preserved during the chunking process.

Hierarchical chunking combines different chunking strategies (e.g., first by paragraph, then by sentence) to maintain a hierarchy of information. It is highly effective for preserving document structure. Therefore, it is used quite often for chunking data.

Finally, text tiling algorithmically identifies topic boundaries within a document and splits it into chunks accordingly. This makes it an excellent choice for chunking documents, especially for topic-based retrieval.

In the end, the choice of chunking algorithm is up to the creator of the RAG system. However, in most situations, people opt for one of the latter three algorithms.

How to Create Embeddings

After our documents have been chunked, and occasionally even our lengthy prompts, we need to convert our chunks of text into numerical values. There are multiple ways of doing this, but in general, what we do is convert the chunks into numerical representations. We usually opt for dense embedding, although there are more simplistic approaches, like creating sparse embeddings using the TFIDF method or its variants. In other words, we will represent our chunks as multidimensional vectors of numerical values. To do this we typically use Transformers-based neural network architecture such as BERT, RoBERTa, Sentence-BERT, etc. These models excel at effectively capturing semantic meaning, enabling us to create high-quality embeddings suitable for representing both our data and query.

How to Retrieve Information

Once we create embeddings, we can use a ranking algorithm to retrieve the most relevant chunks from our database for the given query. This process involves computing the cosine similarity between the embedding representing our query and the embeddings representing the chunks in our document database. Through evaluating these similarities, we can identify and select the chunks with the highest similarity scores to enhance our original query. 

There are certainly more advanced information retrieval algorithms, such as fusion algorithms like RRF (Ranked Retrieval Fusion) which combines similarity scores from different retrieval models, and Cohere Rerank, which uses cross-attention mechanisms for understanding the contextual relationship between the query and document for improved ranking. However, these techniques are complex and beyond the scope of this article.

What Is the Large Language Model

This part consists of fewer choices, making it more simple. Large Language Models (LLMs) are advanced Deep Learning algorithms of such magnitude that training one from scratch is usually only feasible for the largest companies. Therefore, you usually need to select one of the pre-trained models.

There are two main options. The first is to choose a non-open-source model and pay for its usage. A prime example is the GPT model developed by OpenAI, upon which ChatGPT was founded. You can integrate your existing information retrieval pipeline directly with the model to generate responses. While this method may be simpler, it might not be financially viable if you need to process a large number of queries daily, as costs can escalate rapidly.

The second option is to use an open-source model. Numerous open-source models are available today, such as the LLaMa model, among others. Open-source models have their own set of advantages and disadvantages. The main advantage is the absence of a fixed cost based on query volume. Instead, costs depend on the computing resources utilized. Additionally, these models are easier to customize due to their transparency. The code for most proprietary LLMs is tightly guarded, limiting modifications.

However, open-source models generally do not perform as well as proprietary models right out of the box. For instance, OpenAI's GPT model will likely outperform any open-source LLM. The advantage of open-source models lies in the potential for fine-tuning to better suit your domain-specific data. Given adequate computing power, you will discover that open-source models, while possibly slightly behind proprietary ones, still deliver excellent results. Moreover, they can effectively serve as the text generation algorithm for your RAG system.

This article covered the fundamentals of Retrieval-Augmented Generation (RAG) systems. It explained what a RAG system is, how it functions, and its main components. The explanations were intentionally kept at a high level to prevent overloading the article with unnecessary details. The purpose of this article is to offer a quick overview of RAG systems. This way you have the information to decide if they would be beneficial for your needs. Should you conclude that an RAG system would be useful in your context, you can proceed to undertake a more detailed exploration of the various algorithms discussed here. This will allow you to strategize and design the framework of your RAG system in detail.

Boris Delovski

Data Science Trainer

Boris Delovski

Boris is a data science trainer and consultant who is passionate about sharing his knowledge with others.

Before Edlitera, Boris applied his skills in several industries, including neuroimaging and metallurgy, using data science and deep learning to analyze images.