In a previous blog article, we explored the concept of Retrieval-Augmented Generation (RAG) systems, detailing their functionality and key components. As a prelude to our upcoming series on constructing an RAG system, this article will focus on creating a comprehensive roadmap. This plan will outline the various elements necessary for designing and assembling a fully functional, advanced system. By the end of this roadmap, you will clearly understand the steps required to construct and implement an effective RAG system.
We will also focus on developing an advanced RAG system, rather than a basic or naive version. Therefore, before we dive into the design process, we will first explain the differences between naive and advanced RAG systems.
What Is the Difference Between Naive RAG Systems and Advanced RAG Systems
A typical, naive RAG system has a simplistic working manner. When the user writes a query, we find information relevant to that query in our knowledge base. Afterward, we create a new version of the query that includes that extra information.
In such a system, the role of the LLM shifts from being a primary source of information to function more as a summarizer. It uses its text-generation capabilities to create an answer based on the additional knowledge provided alongside the original query. However, the approach illustrated above represents what we call a naive RAG system. These systems can indeed perform satisfactorily. However, they sometimes encounter issues common to LLMs, the most significant being hallucinations.
LLM hallucinations have already been discussed in a previous article on our blog. Nevertheless, to refresh your memory: they occur when an LLM generates information that is false or misleading. Essentially, these are fabricated facts that the model produces while trying to fulfill a user's request.
Hallucinations are a common challenge for LLMs, and since LLMs are a key part of RAG systems, they also affect RAG systems. Even when an LLM is provided with additional information from our knowledge base to supplement the query, it might still produce false information. This is a significant issue, as the primary goal of a RAG system is to avoid such inaccuracies.
This is where the distinction between naive RAG systems and advanced RAG systems becomes crucial. Advanced RAG systems are designed to minimize hallucinations. While it is impossible to guarantee that an LLM will never generate false data, it is possible to reduce the occurrence of such problems.
The essential ways to improve the results produced by RAGs are largely unrelated to the LLMs themselves. Instead, they focus on other parts of the RAG system. While better prompts and prompt engineering can help reduce hallucinations in LLMs, this approach does not scale well over time. This is because new LLMs are continually being released. Moreover, the same prompt engineering techniques that worked for older models may not be effective for newer ones. Therefore, the more effective approach is to improve the other components of the RAG system.
Our RAG system design can therefore be broken down into the following sections:
- document processing and creating a knowledge base
- information retrieval
- enriched query processing
Article continues below
Want to learn more? Check out some of our courses:
How to Process Documents and Create a Knowledge Base
When building an RAG system, the first decision we need to make is how to process documents and split text from them into chunks. The first problem that occurs is that documents come in different formats. In an ideal world, all text would be stored in text files. Such text documents could then be easily processed with Python. The task of reading text data from such documents is trivial.
In practice, the situation is a little more complex. To build a robust knowledge base, we need to combine information from multiple sources and formats. This includes parsing different types of documents, such as PDF files, HTML files, etc.
The simplest approach to solving this problem is to use different parsers for different formats. In Python, there is a plethora of libraries designed specifically to process a certain document format. Ideally, we should seek solutions that can manage a variety of data types. This approach helps us avoid the necessity of importing and using several different libraries. Because it offers a way to easily parse different types of data, we will use LangChain for this part of our RAG system. Additionally, LangChain serves as a great framework for working with LLMs.
This is a particularly powerful framework, designed for developers who want to create, work with and manage LLM applications. LangChain enables users to easily parse a wide variety of formats:
- CSV files
- HTML files
- PDF files
- JSON files
- etc.
In the background, LangChain uses libraries specific to a particular format. However, there is no need to worry about that. All the users have to do is utilize the different parsers directly from LangChain.
LangChain will not only be used for parsing. Another usage for it will be to split the processed documents into chunks. There are many text splitters in LangChain, that users can use to split their documents into chunks:
- Recursive splitter
- HTML splitter
- Markdown splitter
- Character splitter
- Semantic splitter
- etc.
After our text has been processed, there is still the need to somehow convert our chunks of data into embeddings. We also need to create a vector database inside which we can store those embeddings.
To generate embeddings, we will use a pre-trained transformer-based embedding model, like the Nomic embedding model. This model will create embeddings for the document chunks that we will store in a vector database built with ChromaDB.
Chroma is a vector database that integrates with LangChain and LlamaIndex, two leading frameworks for developing LLMs. Using Chroma will make it much easier to build our RAG system, given that we are utilizing LangChain for text processing.
What Is Information Retrieval
In our previous article, we introduced information retrieval algorithms and explored both simple and advanced methods for extracting information from a knowledge base. We mentioned a basic technique that involves finding embeddings in a vector database that closely match the query’s embeddingHowever, this naive approach is unlikely to achieve optimal results.
- Retrieval-Augmented Generation (RAG): Preprocessing Data for Vector Databases
- Retrieval-Augmented Generation (RAG): How to Work with Vector Databases
To improve our retrieval process, we will integrate several techniques:
- query rewriting
- fusion retrieval and reranking
Query rewriting uses a Large Language Model (LLM) to rephrase the user’s query, which improves the retrieval process. Poorly constructed queries can lead to suboptimal model performance. Therefore, by utilizing LangChain's functionality to rewrite queries, we can better align them with the model’s expectations. As a result, we will have better outcomes.
Fusion retrieval combines the naive similarity search between the query vector and vectors in the database with a more traditional keyword-based search. Rerankers, which reorder the retrieved results to improve relevance, are an important design consideration. They can increase query processing time. Therefore, in our project, we will give users the option to enable reranking for improved accuracy or disable it for better performance.
How to Process the Enriched Queries
The final step is to define the LLM that we will use for this project. Fortunately, we have considerable flexibility here thanks to the role the LLM plays in a RAG system. In a well-designed RAG system, the LLM should be able to access all necessary information to generate correct answers.
When selecting an LLM, the first decision is whether to use an open-source model or to access a paid API model. To ensure everyone can follow along with our series without incurring extra costs, we will opt for a free model.
Next, we need to determine the size of the model. Models generally fall into two categories: small and large. Small models have a single-digit number of parameters in billions, such as the Llama 3 8B model and the Phi-3 model. While these models may perform slightly worse than larger ones, they are faster. Conversely, large models, like Llama 3 70B and Falcon 180B, are significantly bigger and perform better. However, they require substantial resources to run efficiently.
For our demonstration, we will use one of the smaller models, specifically the Llama 3 8B. Despite its smaller size, it is a highly advanced and recent model. This will allow us to achieve good results, especially when integrated into our RAG system.
In this article, we provided an overview of the RAG system we will build in upcoming articles. Our plan begins with LangChain for text extraction and chunking from diverse file types. This is followed by creating embeddings with the Nomic embedding model and storing them in a Chroma DB vector database. When a user submits a query, a fusion retrieval algorithm will leverage both the vectors in the Chroma DB database and a keyword search algorithm to identify relevant text chunks and return them. Finally, we will send the enriched query to a Llama 3 8B model, and the output we receive will be the final output of our RAG system.