Retrieval Augmented Generation — IN & OUT

This article covers exhaustive details around building a RAG pipeline with bare models from sentence transformers and improving the retrieval efficiency by using a cross-encoder model

Mentor

Blog

Introduction

Retrieval Augmented Generation (RAG) is a powerful approach to enhance the capabilities of Large Language Models (LLMs) by providing them with additional context beyond their training data. This technique is particularly valuable in scenarios where a company deals with vast amounts of unstructured data, such as internal documents containing product pricing, product descriptions, and legal terms and conditions. RAG enables LLMs to better understand and respond to user queries related to the internal documents, making it a valuable tool for various teams within the organization. This solution provides quick and accurate information, ultimately enhancing productivity of the employee.

In this article, we will explore the key components of a RAG pipeline and also see the code implementation in python.

Note: You’re welcome to skip certain sections in this article, as they include additional elements in the codebase which is slightly beyond the scope of our discussion.

Key Components of a RAG pipeline

  1. Document Parser: Extracts text from unstructured documents
    1. Post Processor: Converts the extracted text to suit the requirements
      1. Embedding: Converts the extracted text into vectors
        1. Indexing: Builds index for the vectors
          1. Retrieval: Finds top similarity chunks in response to user queries
            1. Re-ranker: Re-adjusts the ranking of retrieved documents using Cross Encoder
              1. LLM Invocation: Utilizes the LLM API to generate responses
                1. Frontend: Utilizes Gradio to provide a chat interface for users

                  Document Parser

                  The crucial element in building the current pipeline is chosing an optimal PDF size or sufficient number of PDF documents which will serve as our dataset for posing specific questions.

                  For ease of use, I have considered Long-Walk-To-Freedom-Autobiography-of-Nelson-Mandela which is 500 pages long and just about caters to our requirement.

                  Next step would be to just parse this document to extract the text and store it in a structured form as a dataframe. This dataframe contains two columns Page Number and Content. For this, we will leverage the unstructured library in python. Below is the code snippet for this operation

                  Extracting pdf contents with unstructured library

                  In the above code, partition_pdf is the method within unstructured that extracts the text . The elements object is then used to convert it to dictionary using the in-built method convert_to_dict. We then use the post_process custom based method to just extract the page number and content into a dictionary. Finally we convert this dictionary to a dataframe and store it under the columns page_number and content.

                    Processes the elements object retrieved from unstructured partition_pdf method

                  Below is an additional code for PDF documents with images (Feel free to skip the content and move to next section)

                  Now, for PDF documents that includes images I have introduced a flag which will be responsible for triggering the respective code. Below is the code snippet

                  Extracting text from PDF with Fitz library

                  The above code leverages fitz library to follow the below steps:

                  • Opens the PDF file
                    • Iterates through each page
                      • Extracts the text with 
                        • Extracts images (if any) using the 
                          • Temporarily saves the image to a local directory
                            • Passes the image as a parameter to custom defined 

                              Below is the snippet for extract_text_from_image method which leverages opencv library to perform OCR tecnhiques to extract the text.

                              Opencv library in action to extract text from image

                              Opencv library has to follow the usual protocols as below to end up with extracting text from the image

                              • Reads the image
                                • Converts the image to greyscale
                                  • Applies thresholding (essentially done to separate the foreground from the background in the image)
                                    • Defines the structuring element (Rectangular, elliptical etc.) and dilate the image for higher accuracy of text detection in the image
                                      • Fetches the contours — this gives us the co-ordinates of the object detected
                                        • Iterates over the contours and uses 

                                          Now, we have the text which is appended to the content extracted from the same page. The extracted content is then saved as a csv file for the downstream processes.

                                          Note: All the text identified from the image is always appended to the bottom of that specific page content

                                          Post Processor

                                          This code module mainly deals with an additional layer of post processing over and above the processing that we have already implemented.

                                          This involves applying techniques to enhance the efficiency of the retrieval process in our RAG pipeline. In this instance, there was a minor inefficiency because the autobiography was written in the first person. Consequently, when queries about Nelson Mandela were made, the model had difficulty understanding the context from the pages that referenced him. To address this, I replaced the first-person terms with “Nelson Mandela” to improve both the quality of retrieval and the accuracy of the responses.

                                          Below is the code snippet:

                                          PostProcessor class methods which helps in post processing the text extracted from PDF

                                          The main method leverages preprocess and preprocess_helper methods to replace the pronouns like “I, Me, My, Mine, Myself” with “Nelson Mandela”. You can add more pronouns to further enhance the efficiency, but this workaround has yielded better responses according to my observations.

                                          Embedding & Indexing

                                          Now that we have the structured information extracted from the PDF into a dataframe, the next step is selecting an embedding model that would suit our requirements. For this, we have the sbert.net which provides an overview of all the pre-trained models that has been extensively evaluated across different tasks. I have chosen multi-qa-MiniLM-L6-cos-v1 model for its semantic search performance. However, you can experiment with other models listed on the site to test their efficiency.

                                          Before proceeding to create the embeddings it’s necessary to convert each row of the dataframe into a string. This will result in having the content of each page as a string within a list of strings. Below is the code snippet for this operation.

                                          Converts the dataframe to list of string values combining both the page number and content columns

                                          We can now proceed to generate the embeddings for the list we obtained. Once these embeddings are created, we use the FAISS library to store them in a vector store. Below is the code snippet which does this operation.

                                          Creates index for the embeddings and stores it in local vector store using FAISS

                                          get_content_index method converts the dataframe to list of string values. generate_embeddings method (ref. below) generates the embeddings using the model that we have initialized in the init method.

                                          Embeddings generator based on the model initialized in the init functionInitializing the sentence transformers model

                                          Now, we have the vector store for the PDF file we have finalized, which will be used by the model to retrieve appropriate context and respond to the user queries.

                                          Retrieval & Re-ranking

                                          In the retrieval methodology we typically follow the below steps:

                                          Generate embeddings for the user query

                                          We use the same generate_embeddings method from the code above to generate embeddings for the user query

                                          Search the index for the generated embeddings

                                          Below is the code snippet to search the index

                                          Read the saved index file and search using the query embedding

                                          We now have the similarity scores and the index ID’s for the query embeddings. Remember these will have scores for all the pages in the PDF.

                                          Retrieve top k matches from the retrieved index

                                          We will now follow the below code snippet to retrieve the top k matches from the similarity scores

                                          Retrieve top k matches from the similarity scores and index ID’s

                                          To extract the top k matches, we first read the dataframe saved during the document parsing step. Using the index IDs obtained from the search_index method, we then iterate over the dataframe to extract content in descending order of similarity scores. A seen_candidates set is used to track the candidates encountered during the iteration, ensuring that the content, along with the page number and index ID, is added to the output dictionary appropriately

                                          Note: There are two parameters that can get bit confusing in this code piece. init_k is a parameter that is responsible for extracting initial set of documents as a match for the user query. We then use this set of documents to re-rank them and finally end up with the final_k matches. Hence, the final_k parameter is responsible for filtering out and giving the final set of top matches for the user query. We have defined init_k to 10 and final_k to 3.

                                          Feel free to skip content below and proceed to the next sub-section

                                          I have divided the code into two sections to ideally differentiate between a chunk and a page number. As our example only has one long PDF we have considered each chunk to be a page number, ideally in real world scenarios there might be multiple long PDF documents, which is when we have to consider a chunk size to better maintain the context and improve the retrieval of the search. So in this case, we iterate over chunk instead of page number and maintain the appropriate information in the output dictionary.

                                          Re-Ranking

                                          While we can use the initial set of documents retrieved from searching the index, we enhance response efficiency by employing a post-retrieval mechanism using cross-encoder models. These models refine the alignment of the initial set of documents to the user query by jointly considering both query embeddings and document embeddings. This process re-ranks the initial set of documents to produce the final set of top k matches. Below is the code snippet for this operation.

                                            Re-ranking the output dictionary obtained from the get_top_kmethod using the cross encoder model

                                          Finally, we need to post-process the output to provide context for the LLM. To do this, we simply convert these matches into a single, consolidated string. Below is the code snippet for the same.

                                          Convert the output dictionary to a string by taking the top 3 matches

                                          Feel free to skip the content below and proceed to the next section

                                          Now that we have the context as a string to pass to the LLM, we encounter another potential issue: the cost associated with using an LLM, which is influenced by the number of tokens in the prompt. To mitigate this cost, we need to reduce the number of tokens in the prompt. A potential solution can be to use a model called LLMLingua, which uses a prompt compression technique. Below is the code snippet which does this operation.

                                          compress_prompt method from LLMLingua Prompt Compressor class

                                          The provided code snippet returns a compressed prompt that can be injected into the LLM prompt, thereby helping to reduce the number of tokens. Unfortunately, I wasn’t able to test the efficiency on my machine as the model requires an Nvidia GPU. While it’s possible to use the model with a CPU, doing so would significantly increase the latency of the response. Hence, I commented out the sections relevant to prompt compression for now.

                                          LLM Invocation and frontend

                                          The final and simplest step of this pipeline is to call the LLM of our choice to generate a response to the user query based on the retrieved context. Below is the code snippet to call Mistral AI API.

                                          Prompt and roles defined for the LLM to respond to the user query

                                          Frontend Chat Interface

                                          The final step is to provide an interface for the user to interact with the backend LLM. For this purpose, we chose Gradio, which simplifies the frontend code to just one line, providing an interactive interface for the user. For additional functionalities around the frontend you can refer the gradio docs. Below is the code snippet.

                                          run_app is the driver function for the whole backend logic

                                          ChatInterface class provides an interactive way of chatting with the LLM through gradio library. It takes in the function (fn) parameter which is called with the given user input to drive the backend functionality. run_app is the method executed which retrieves the context based on the user input and calls the LLM for the final response.

                                          Next Steps

                                          The research in this field has been evolving at a very fast pace, with companies trying to improve this retrieval mechanism. Microsoft Azure has come up with Azure AI Service which employs keyword as well as vector search to better improve the efficiency. It also provides the citations of the top k matches of the user query, from where it formulated its response. Moreover, vector search in the backend builds an HNSW graph structure to propagate through the nodes quickly to identify the similar nodes to the user query.

                                          Llama Index framework has also employed a novel technique called RAG fusion which employs a logic to form multiple versions of the user query in the first step and then extract the top k matches for each version. Finally it fuses all these documents to end up with the final set as the context for the LLM.

                                          Github Codebase

                                          The complete codebase

                                          References

                                          1. Sentence Transformers
                                            1. Cross Encoders
                                              1. FAISS Indexing
                                                1. Gradio Docs
                                                  1. Mistral AI Docs
                                                    1. Unstructured Library
                                                      1. Opencv text extraction
                                                        1. LLM Lingua