This article covers exhaustive details around building a RAG pipeline with bare models from sentence transformers and improving the retrieval efficiency by using a cross-encoder model
Blog
Retrieval Augmented Generation (RAG) is a powerful approach to enhance the capabilities of Large Language Models (LLMs) by providing them with additional context beyond their training data. This technique is particularly valuable in scenarios where a company deals with vast amounts of unstructured data, such as internal documents containing product pricing, product descriptions, and legal terms and conditions. RAG enables LLMs to better understand and respond to user queries related to the internal documents, making it a valuable tool for various teams within the organization. This solution provides quick and accurate information, ultimately enhancing productivity of the employee.
In this article, we will explore the key components of a RAG pipeline and also see the code implementation in python.
Note: You’re welcome to skip certain sections in this article, as they include additional elements in the codebase which is slightly beyond the scope of our discussion.
The crucial element in building the current pipeline is chosing an optimal PDF size or sufficient number of PDF documents which will serve as our dataset for posing specific questions.
For ease of use, I have considered Long-Walk-To-Freedom-Autobiography-of-Nelson-Mandela which is 500 pages long and just about caters to our requirement.
Next step would be to just parse this document to extract the text and store it in a structured form as a dataframe. This dataframe contains two columns Page Number and Content. For this, we will leverage the unstructured library in python. Below is the code snippet for this operation
In the above code, partition_pdf is the method within unstructured that extracts the text . The elements object is then used to convert it to dictionary using the in-built method convert_to_dict. We then use the post_process custom based method to just extract the page number and content into a dictionary. Finally we convert this dictionary to a dataframe and store it under the columns page_number and content.
Below is an additional code for PDF documents with images (Feel free to skip the content and move to next section)
Now, for PDF documents that includes images I have introduced a flag which will be responsible for triggering the respective code. Below is the code snippet
The above code leverages fitz library to follow the below steps:
Below is the snippet for extract_text_from_image method which leverages opencv library to perform OCR tecnhiques to extract the text.
Opencv library has to follow the usual protocols as below to end up with extracting text from the image
Now, we have the text which is appended to the content extracted from the same page. The extracted content is then saved as a csv file for the downstream processes.
Note: All the text identified from the image is always appended to the bottom of that specific page content
This code module mainly deals with an additional layer of post processing over and above the processing that we have already implemented.
This involves applying techniques to enhance the efficiency of the retrieval process in our RAG pipeline. In this instance, there was a minor inefficiency because the autobiography was written in the first person. Consequently, when queries about Nelson Mandela were made, the model had difficulty understanding the context from the pages that referenced him. To address this, I replaced the first-person terms with “Nelson Mandela” to improve both the quality of retrieval and the accuracy of the responses.
Below is the code snippet:
The main method leverages preprocess and preprocess_helper methods to replace the pronouns like “I, Me, My, Mine, Myself” with “Nelson Mandela”. You can add more pronouns to further enhance the efficiency, but this workaround has yielded better responses according to my observations.
Now that we have the structured information extracted from the PDF into a dataframe, the next step is selecting an embedding model that would suit our requirements. For this, we have the sbert.net which provides an overview of all the pre-trained models that has been extensively evaluated across different tasks. I have chosen multi-qa-MiniLM-L6-cos-v1 model for its semantic search performance. However, you can experiment with other models listed on the site to test their efficiency.
Before proceeding to create the embeddings it’s necessary to convert each row of the dataframe into a string. This will result in having the content of each page as a string within a list of strings. Below is the code snippet for this operation.
We can now proceed to generate the embeddings for the list we obtained. Once these embeddings are created, we use the FAISS library to store them in a vector store. Below is the code snippet which does this operation.
get_content_index method converts the dataframe to list of string values. generate_embeddings method (ref. below) generates the embeddings using the model that we have initialized in the init method.
Now, we have the vector store for the PDF file we have finalized, which will be used by the model to retrieve appropriate context and respond to the user queries.
In the retrieval methodology we typically follow the below steps:
We use the same generate_embeddings method from the code above to generate embeddings for the user query
Below is the code snippet to search the index
We now have the similarity scores and the index ID’s for the query embeddings. Remember these will have scores for all the pages in the PDF.
We will now follow the below code snippet to retrieve the top k matches from the similarity scores
To extract the top k matches, we first read the dataframe saved during the document parsing step. Using the index IDs obtained from the search_index method, we then iterate over the dataframe to extract content in descending order of similarity scores. A seen_candidates set is used to track the candidates encountered during the iteration, ensuring that the content, along with the page number and index ID, is added to the output dictionary appropriately
Note: There are two parameters that can get bit confusing in this code piece. init_k is a parameter that is responsible for extracting initial set of documents as a match for the user query. We then use this set of documents to re-rank them and finally end up with the final_k matches. Hence, the final_k parameter is responsible for filtering out and giving the final set of top matches for the user query. We have defined init_k to 10 and final_k to 3.
Feel free to skip content below and proceed to the next sub-section
I have divided the code into two sections to ideally differentiate between a chunk and a page number. As our example only has one long PDF we have considered each chunk to be a page number, ideally in real world scenarios there might be multiple long PDF documents, which is when we have to consider a chunk size to better maintain the context and improve the retrieval of the search. So in this case, we iterate over chunk instead of page number and maintain the appropriate information in the output dictionary.
While we can use the initial set of documents retrieved from searching the index, we enhance response efficiency by employing a post-retrieval mechanism using cross-encoder models. These models refine the alignment of the initial set of documents to the user query by jointly considering both query embeddings and document embeddings. This process re-ranks the initial set of documents to produce the final set of top k matches. Below is the code snippet for this operation.
Finally, we need to post-process the output to provide context for the LLM. To do this, we simply convert these matches into a single, consolidated string. Below is the code snippet for the same.
Feel free to skip the content below and proceed to the next section
Now that we have the context as a string to pass to the LLM, we encounter another potential issue: the cost associated with using an LLM, which is influenced by the number of tokens in the prompt. To mitigate this cost, we need to reduce the number of tokens in the prompt. A potential solution can be to use a model called LLMLingua, which uses a prompt compression technique. Below is the code snippet which does this operation.
The provided code snippet returns a compressed prompt that can be injected into the LLM prompt, thereby helping to reduce the number of tokens. Unfortunately, I wasn’t able to test the efficiency on my machine as the model requires an Nvidia GPU. While it’s possible to use the model with a CPU, doing so would significantly increase the latency of the response. Hence, I commented out the sections relevant to prompt compression for now.
The final and simplest step of this pipeline is to call the LLM of our choice to generate a response to the user query based on the retrieved context. Below is the code snippet to call Mistral AI API.
The final step is to provide an interface for the user to interact with the backend LLM. For this purpose, we chose Gradio, which simplifies the frontend code to just one line, providing an interactive interface for the user. For additional functionalities around the frontend you can refer the gradio docs. Below is the code snippet.
ChatInterface class provides an interactive way of chatting with the LLM through gradio library. It takes in the function (fn) parameter which is called with the given user input to drive the backend functionality. run_app is the method executed which retrieves the context based on the user input and calls the LLM for the final response.
The research in this field has been evolving at a very fast pace, with companies trying to improve this retrieval mechanism. Microsoft Azure has come up with Azure AI Service which employs keyword as well as vector search to better improve the efficiency. It also provides the citations of the top k matches of the user query, from where it formulated its response. Moreover, vector search in the backend builds an HNSW graph structure to propagate through the nodes quickly to identify the similar nodes to the user query.
Llama Index framework has also employed a novel technique called RAG fusion which employs a logic to form multiple versions of the user query in the first step and then extract the top k matches for each version. Finally it fuses all these documents to end up with the final set as the context for the LLM.
Copyright ©2024 Preplaced.in
Preplaced Education Private Limited
Ibblur Village, Bangalore - 560103
GSTIN- 29AAKCP9555E1ZV