Retrieval-Augmented Generation (RAG): Preprocessing Data for Vector Databases

A guide to preprocessing data for vector databases.
By Boris Delovski • Updated on Sep 19, 2024
blog image

The way Retrieval-Augmented Generation (RAG) systems work is straightforward: when a user writes a query, the system first searches through a knowledge base to find information relevant to that query. Afterward, it creates a new version of the original query, including the retrieved information. This provides the LLM in our RAG system with essential context, enhancing its ability to answer the user's question. This shows that the overall performance of the RAG system heavily depends on our knowledge base. Essentially, the more precise the information retrieval process, the better the final result.

We could use a standard database to store our information and use that database as our knowledge base. However, this is not ideal for multiple reasons. Firstly, we often need to store large quantities of data, therefore finding the most efficient method to store data is crucial. Secondly, when a user writes a query, we need an effective method to identify and select the most relevant data from our knowledge base.

Both of these limitations indicate that using standard databases for storing text is not the right approach. Instead, we will build what we call a vector database. This database will store our text data in the form of embeddings, that is, as vectors. However, before we can create our vector database, we first need to preprocess our data, which is the focus of this article.

What Is a Vector Database?

While the focus of this article is data preprocessing, it is essential to understand what a vector database is for comprehending why our data needs to be preprocessed in a specific manner. A vector database is a specialized type of database designed to store and manage data in vector form. These databases are optimized for handling high-dimensional data efficiently. They are designed to scale with large datasets, often containing millions or even billions of vectors.

In addition, they are performance-optimized, often utilizing hardware acceleration such as GPUs for rapid information retrieval. The primary function of vector databases is to perform similarity searches. The system quickly identifies vectors that are most similar to a given query vector by calculating metrics such as Euclidean distance or cosine similarity.

This functionality makes vector databases ideal for our needs, as they enable rapid searches through vast amounts of data to find information relevant to our queries. To store text in such a database, we first need to convert the text into vectors, known as embeddings. In essence, our entire RAG system works by converting a query into an embedding. It then calculates the similarity between this query embedding and the embeddings stored in our database and returns the data most relevant to the query. This additional information enriches the original query, helping the LLM in returning more accurate answers and reducing the likelihood of LLM hallucinations

While there are many vector databases available, they all function in a highly similar way. As a result, the choice of vector database primarily depends on the framework you plan to use for implementing your LLM. In our scenario, we will use Langchain and LlamaIndex to build our LLM. Therefore, it is best to choose a database that integrates smoothly with these frameworks to reduce potential friction when combining different libraries. This strategy helps us avoid any unexpected or undesirable interactions between the database and the LLM. As a result, we can concentrate on the core task of constructing the RAG system.

In this example, we will use Chroma as our database. However, we will delve into Chroma and the process of creating embeddings in the next article. In this article, we will focus on the essential data preprocessing steps. Specifically, we will cover building a data loader to import our data and developing a splitting algorithm to divide our documents into chunks. These chunks can later be encoded into embeddings.

Article continues below

How to Preprocess Text

The first step in preparing data is to ensure that we have an efficient method for loading it. Langchain offers various data loaders to handle this task. The choice of the loader depends only on the type of data we plan on working with. In this example, since we are dealing with PDF files, I will use the PyPDFDirectoryLoader. This loader takes a directory path as input and loads all PDF files found within that directory and all of its subdirectories.

After loading the data, the next step is to split it into chunks. As discussed in our previous article, generating a separate vector for every word in large documents is not practical. Our objective is to retrieve entire paragraphs relevant to our query, not individual words. Therefore, we will split the text into manageable chunks and create a unique vector for each chunk. There are various approaches to splitting text into chunks. In this example, we will use the recursive splitter from Langchain to split our loaded text into chunks. 

The recursive text splitter in Langchain is designed to break down large pieces of text into smaller, manageable chunks while preserving the semantic integrity of the content. This is particularly useful for processing large documents, ensuring that each chunk maintains its coherent meaning. The process is straightforward:

  • text is provided to the splitter
  • the text is first split into larger units like paragraphs or sections
  • each of these is checked against the size limit we define when creating the splitter
  • if any chunk is still too large, it will be further split into smaller units, until all chunks are of the desired size

Let’s demonstrate how we can load and split documents using Langchain. 

How to Load Documents

As mentioned in the previous section, we will use the PyPDFDirectoryLoader from Langchain to load PDFs from a directory. This loader can read and load multiple PDF files from a specified directory at once. Once the PDF documents are loaded, they can be easily integrated into a Langchain pipeline, which we will build later.

from langchain_community.document_loaders import PyPDFDirectoryLoader

# Create loader and load documents
loader = PyPDFDirectoryLoader("data")
documents = loader.load()

print(f"Found {len(documents)} documents")

I have a directory named "data", which contains one PDF file and a subdirectory. Inside that subdirectory, there is another PDF file. After loading both PDFs using PyPDFDirectoryLoader, I ended up with 23 documents, even though there are only two PDF files present in my directory (one being in a subdirectory inside “data”). This behavior is expected because the loader splits the content of each PDF into smaller segments or "documents." The splitting is based on internal logic. For example, if my two PDFs collectively contain 23 pages, the loader might create 23 separate documents. However, the splits are not necessarily based solely on pages. The exact splitting logic is not crucial, as we will further split the documents into smaller chunks in the next step.

One final note before we move on to splitting: it is generally a good idea to use the lazy_load() function of the loader instead of the standard load() function. Unlike load(), which returns all documents at once, lazy_load() returns an iterator. This approach is useful because it allows for processing documents one at a time. As a result, it reduces memory usage and improves efficiency, especially when dealing with large numbers of documents. We will revisit this when we start building a pipeline. For now, let's wrap the code we used above into a function, at the same time replacing the load() function with the lazy_load() function.

from langchain_community.document_loaders import PyPDFDirectoryLoader

# Create function that loads documents
def load_documents(data_path):
    """
    Loads documents from the specified directory and its subdirectories.

    Returns:
        generator: A generator yielding loaded documents.
    """

    return PyPDFDirectoryLoader(data_path).lazy_load()

How to Split Documents

As mentioned in the previous section, the recursive splitter from Langchain will be used to split documents. There are a few parameters we have to define when creating a recursive splitter:

  • chunk_size 
  • chunk_overlap
  • length_function
  • is_separator_regex

The chunk_size parameter determines the maximum size (in characters) of each chunk that the text will be split into. For example, if chunk_size is set to 100, the text splitter will attempt to create chunks of text with up to 100 characters. The default chunk size the splitter splits documents into is 800.

The chunk_overlap parameter specifies the number of characters that should overlap between consecutive chunks. For instance, if chunk_overlap is set to 200, the last 200 characters of one chunk will be repeated at the beginning of the next chunk. The default value for this parameter is 80.

The length_function parameter allows you to specify a custom function to determine the "length" of the text when splitting. The default function is len, which measures the length in terms of characters.

Finally, the is_separator_regex parameter is a boolean parameter that determines whether the separator used for splitting is a regular expression (regex). If set to True, the separator is treated as a regex; if False, it's treated as a plain string.

For our example, let’s use standard values and define the splitter as follows:

def split_documents(documents):
    """
            Splits the provided documents into chunks using the specified splitter type.

            Args:
                documents (list[Document]): A list of Document objects to be split.

            Yields:
                Document: Chunks of the original documents after splitting.
    """
    
    text_splitter = RecursiveCharacterTextSplitter(
                chunk_size=800,
                chunk_overlap=80,
                length_function=len,
                is_separator_regex=False,
            )

    for document in documents:
        for chunk in text_splitter.split_documents([document]):
            yield chunk

As can be seen, once again I decided to create an iterator instead of return values,  similar to the lazy loading process we defined earlier. This approach allows the function to return one chunk at a time, enhancing memory efficiency, especially when working with large datasets.

This concludes the data preprocessing steps. The chunks returned by the split_documents() function are what we will convert into vectors, and store in our vector database. We will cover this process in the following article.

This article provided an overview of vector databases and their role in Retrieval-Augmented Generation (RAG) systems. While this article introduces the topic, a more in-depth exploration will follow in the next article of this series. Here we focused on the preprocessing steps required to prepare our data for vector conversion and storage in a vector database. We used Langchain to create a data loader for importing documents from a directory and a splitter to divide those documents into manageable chunks. In the next article, we will explore how to convert these chunks into vectors and build a vector database from them.

Boris Delovski

Data Science Trainer

Boris Delovski

Boris is a data science trainer and consultant who is passionate about sharing his knowledge with others.

Before Edlitera, Boris applied his skills in several industries, including neuroimaging and metallurgy, using data science and deep learning to analyze images.