arXiv RAG Project Part 2 | Building the Prototype

Bringing the Parts Together to Construct a Working Prototype
ai
machine learning
prototype
project
deep learning
nlp
rag
Author

Jack Tol

Published

May 26, 2024

Note

This blog post is available in audio format as well. You can listen using the player below or download the mp3 file for listening at your convenience.

Quick Tip!

This blog post is a continuation of the arXiv RAG project I have been working on and assumes prior knowledge of Language Models and RAG. If you haven’t already, please review Part 1 of this series and the introductory post on Language Models and RAG.

Important Note!

The code for the different functions of this prototype is explained in detail. The code blocks in these sections are hidden by default but can be viewed by opening up the “Show Code” folding subsection, which is present at the top of such sections.

1.0 | Introduction

In the previous blog post, we covered a detailed introduction to the project, in addition to the code and explanation of the metadata downloading, processing, and uploading steps, which was the first big hurdle we needed to get over and one of the main components that allow a project like this to happen at all. This post will cover in detail all of the imports, helper functions, and the main function, including both snippets of the code and explanations, which all come together to create a first look and prototype of the working application.

This prototype was developed in and is intended to be run in a Jupyter Notebook, as shown in the prototype demonstration video which can be found in Section 4.0 | Video Demonstration.

2.0 | Code Review

2.1 | Imports

Before getting into the details regarding all the functions that bring this prototype to life, let’s first briefly cover the different libraries we’ll be using. Understanding these libraries is really important, as one of the primary aspects that allows for the relatively simple and non-verbose development of a project like this is the level of abstraction and functionality these libraries provide.

Consider these libraries and a brief overview of their primary purpose:

import os
import arxiv

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain_experimental.text_splitter import SemanticChunker
from langchain_pinecone import PineconeVectorStore
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import OpenAI

2.1.1 | Python Standard Library + Third Party Imports

  • os - Part of the Python Standard Library used to manage operating system-specific functionalities. In this prototype, it is used to open, read, and write to local files.
  • arxiv - Python wrapper for the arXiv API. In this prototype, it is used to remotely download the PDF of the user-selected paper.

2.1.2 | LangChain Specific Imports

  • HuggingFaceEmbeddings - LangChain’s integration with HuggingFace Embeddings. In this prototype, it is used to perform the embedding on the user query and the document chunks.
  • PyPDFLoader - LangChain’s provided PDF document loader. In this prototype, it is used to load in the remotely downloaded PDF.
  • SemanticChunker - LangChain’s integration with Greg Kamradt’s implementation of the Semantic Chunker. It is used in this prototype for PDF splitting and chunking.
  • PineconeVectorStore - LangChain’s integration with Pinecone, which allows for seamless uploading and retrieving from a Pinecone index.
  • ChatPromptTemplate - LangChain’s implementation of a prompt template, which allows for easy dynamic insertion of retrieved context and a user’s query.
  • StrOutputParser - LangChain’s Output Parser, which parses the result from LLM into a string of only the output. In this prototype, it is used to parse the output from the OpenAI inference model to the raw output string.
  • OpenAI - LangChain’s integration with OpenAI, which allows for the simple creation of OpenAI inference models and related tools.

2.2 | Global Variables

Let’s now briefly cover some of the global variables we define before getting started with the helper functions.

embedding_model = HuggingFaceEmbeddings()
metadata_vector_store = PineconeVectorStore.from_existing_index(embedding=embedding_model, index_name="arxiv-metadata")
chunks_vector_store = PineconeVectorStore.from_existing_index(embedding=embedding_model, index_name="arxiv-project-chunks")
semantic_chunker = SemanticChunker(embeddings=embedding_model, buffer_size=1, add_start_index=False)
  • embedding_model - Set to LangChain’s HuggingFaceEmbeddings() object. We’ll be using this embedding model whenever we are embedding and uploading document chunks or a user query for the similarity search.
  • metadata_vector_store - Configured with the LangChain-Pinecone integration to simplify the process of calling methods on the vector store.
  • chunks_vector_store - Configured with the LangChain-Pinecone integration to facilitate uploading metadata and invoking methods on the vector store.
  • semantic_chunker - Configured as the Semantic Chunker object, incorporating our embedding model, buffer size, and start index settings, which makes it easier to integrate the Semantic Chunker into our process for uploading new research papers.

2.3 | Helper Functions

In this section, we cover the various helper functions that bring this prototype to life. In each section, the code block is provided and can be viewed by clicking on the “Show Code” button, in addition to detailed explanations covering what the function does and why.

2.3.1 | Download, Process & Upload a New Research Paper

Show Code
def process_and_upload_chunks(document_id):
    print("Downloading Paper")
    paper = next(arxiv.Client().results(arxiv.Search(id_list=[str(document_id)])))
    paper.download_pdf(filename=f"{document_id}.pdf")
    loader = PyPDFLoader(f"{document_id}.pdf")
    print("Processing & Uploading Paper")
    pages = loader.load_and_split()

    chunks = []
    for page in pages:
        text = page.page_content
        chunks.extend(semantic_chunker.split_text(text))
    chunks_vector_store.from_texts(texts=chunks, embedding=embedding_model, metadatas=[{"document_id": document_id} for _ in chunks], index_name="arxiv-project-chunks")
    os.remove(f"{document_id}.pdf")
    print("Paper Uploaded. Please Proceed To Ask Your Question.")

The first consideration is the automatic downloading, loading, chunking, embedding, and uploading of a research paper.

We use a combination of the arXiv library for remote downloading of the research paper PDF, followed by LangChain to handle the subsequent steps.

After reading the arXiv API wrapper library documentation, we see that it is very easy to remotely download a research paper from arXiv using the document_id, which is perfect because, as discussed in part 1 of the series, every title_by_author vector stored within our metadata vector database has a corresponding document_id stored in its metadata.

With this in mind, we first create a function, process_and_upload_chunks, which takes in a document_id as an argument and uses this document_id to aid in the downloading, processing, and uploading stages. The function begins by printing "Downloading Paper" to the user, indicating the start of the downloading process. It then uses arxiv.Client() to search for the paper specified by the document_id, downloading its PDF with paper.download_pdf(filename=f"{document_id}.pdf"). When a paper is downloaded using this method, the name of the downloaded file is in the format of document_id.pdf. This is important to note, as the next step is to create a document loader object using LangChain’s PyPDFLoader document loader, which will be used to load the downloaded PDF.

We then print "Processing & Uploading Paper" to the user, indicating that the document download was successful and that the program is proceeding to process and upload the document. Using the PDF document loader object, we load and split the paper into its individual pages with loader.load_and_split(), setting the returned pages to the variable pages.

We then loop through each of these pages, making sure that the content of each page is passed through LangChain’s implementation of Greg Kamradt’s Semantic Chunker, appending the returned chunks to a chunks list using chunks.extend(semantic_chunker.split_text(text)). These chunks are then uploaded to the chunks vector store with chunks_vector_store.from_texts, embedding them using the embedding_model and associating them with metadata containing the document_id. After processing, the function deletes the PDF file with os.remove(f"{document_id}.pdf") and prints "Paper Uploaded. Please Proceed To Ask Your Question.", which communicates to the user the completion of this process.

2.3.2 | Checking if Chunks Already Exist

Show Code
def do_chunks_exist_already(document_id):
    filter = {"document_id": {"$eq": document_id}}
    test_query = chunks_vector_store.similarity_search(query="Chunks Existence Check", k=1, filter=filter)
    return bool(test_query)

Next, we define a function, do_chunks_exist_already, which checks if chunks for a given document_id already exist in the chunks vector database. The function creates a filter using the document_id and performs a similarity search with chunks_vector_store.similarity_search(query="Chunks Existence Check", k=1, filter=filter). Either chunks are found after filtering for the document_id, or they are not. Both cases are dealt with by converting the result to a boolean: True if the list returned is not empty, or False if the list returned is empty, indicating whether chunks for the specified document_id are already present. This implementation helps us manage the flow of logic and the calling of additional functions depending on the state of the application.

2.3.3 | Processing a User Query

Show Code
def process_user_query(document_id):
    context = []
    user_query = input("Please enter your question:\n")
    filter = {"document_id": {"$eq": document_id}}
    search_results = chunks_vector_store.similarity_search(query=user_query, k=10, filter=filter)
    for doc in search_results:
        context.append(doc.page_content)
    return context, user_query

The next step is to define the process_user_query function, which processes a user’s query against the chunks stored for a given document_id. The function starts by initializing an empty context list and prompts the user to enter their question with user_query = input("Please enter your question:\n"). It then creates a filter using the document_id and performs a similarity search with chunks_vector_store.similarity_search(query=user_query, k=10, filter=filter). The search results are looped through, appending each document’s page content to the context list. The function then returns both the context and the user_query.

2.3.4 | Querying with OpenAI

Show Code
def query_openai_with_context(context, user_query):
    template = """Use The Following Context:
    Context: {context}
    To Answer The Following Question:
    {user_query}
    """
    prompt = ChatPromptTemplate.from_template(template)
    model = OpenAI()
    parser = StrOutputParser()
    chain = prompt | model | parser
    output = chain.invoke({"context": context, "user_query": user_query})
    return output

Next, we define the query_openai_with_context function, which uses OpenAI to generate a response based on the provided context and user query. The function defines a template string that instructs the model to use the given context to answer the user’s question. The prompt is created using LangChain’s ChatPromptTemplate.from_template(template). An instance of the OpenAI model is then initialized, and a string output parser (StrOutputParser) is set up. These components are linked together in a chain (prompt | model | parser) using LangChain Expression Language (LCEL). The chain is then invoked with the context and user query, and the generated output is returned.

2.3.5 | Selecting a Research Paper From the Retrieved Results

Show Code
def select_document_from_results(search_results):
    if not search_results:
        print("No search results found.")
        return None
    print("Top search results based on content and metadata:\n")
    for i, doc in enumerate(search_results, start=1):
        page_content = doc.page_content
        document_id = doc.metadata['document_id']
        print(f"{i}: Research Paper Title & Author: {page_content}\n   Document ID: {document_id}\n")
    user_choice = int(input("Select a paper by entering its number: ")) - 1
    if 0 <= user_choice < len(search_results):
        selected_doc_id = search_results[user_choice].metadata['document_id']
        print(f"\nYou selected document ID: {selected_doc_id}")
        return selected_doc_id
    else:
        print("\nInvalid selection. Please run the process again and select a valid number.")
        return None

For the last helper function, we define the select_document_from_results function, which is responsible for selecting a research paper from a list of papers retrieved from the metadata vector store based on the paper title entered by the user. The function starts by checking if the search_results list is empty, and if so, it prints a message “No search results found.” and returns None, indicating that no papers were found. If there are results, it prints a message “Top search results based on content and metadata:”, followed by enumerating each result, showing the Research Paper Title & Author which is stored in page_content, and the document_id. The user is then prompted to select a paper by entering its number.

The function checks if the entered number corresponds to a valid index in the search_results list. If the selection is valid, it retrieves and prints the document_id of the selected paper and returns it. If the selection is invalid, it prompts the user to restart the selection process by entering a valid number. This function ensures users can effectively choose from a list of papers with clarity on what each document represents.

3.0 | The Main Function

Show Code
def main():
    initial_query = input("Enter the title of the paper you wish to learn more about: ")
    search_results = metadata_vector_store.similarity_search(query=initial_query, k=5)
    selected_doc_id = select_document_from_results(search_results)
    if selected_doc_id:
        if not do_chunks_exist_already(selected_doc_id):
            process_and_upload_chunks(selected_doc_id)
        context, user_query = process_user_query(selected_doc_id)
        response = query_openai_with_context(context, user_query)
        print("Response from AI:", response)

The main function is where the flow of the application is defined, and where this prototype comes to life. It starts by prompting the user to enter the title of a paper they are interested in exploring further. This input is captured in the initial_query variable. Using this query, a similarity search is conducted within the metadata vector store to find the top 5 most relevant papers, with the results stored in search_results.

The function then proceeds with select_document_from_results to let the user choose one of the retrieved documents. If a document_id is selected successfully, the function checks if chunks for this document already exist in the system using the do_chunks_exist_already function. If they do not, process_and_upload_chunks is called to process and upload the document chunks to the database. Next, the process_user_query function captures the user’s specific query and gathers the relevant context from the chunks database. This context, along with the user query, is then used by query_openai_with_context to generate an AI-powered response, which is printed to the user. This sequence ensures a comprehensive user interaction where information retrieval and response generation are seamlessly integrated.

4.0 | Video Demonstration

Below, you will find the video covering the demonstration of the project prototype.

5.0 | Conclusion & Next Steps

Now that we have completed the working prototype, it is time to expand upon the functionalities we have here, make it actually look nice, and make it easier to use. All of this and more will be covered in the third and final part of this series. Until then, thank you for reading, and I look forward to seeing you again soon!