arXiv RAG Project Part 3 | Building and Deploying the Application

From Prototype to Deployment: Covering Frameworks, Async Support, Hosting Platforms, and More
ai
machine learning
project
deployment
deep learning
nlp
rag
Author

Jack Tol

Published

June 17, 2024

Note

This blog post is available in audio format as well. You can listen using the player below or download the mp3 file for listening at your convenience.

Quick Tip!

This blog post is a continuation of the arXiv RAG project I have been working on and assumes prior knowledge of Language Models and RAG. If you haven’t already, please review Part 1 and Part 2 of this series, in addition to the introductory post on Language Models and RAG.

Important Note!

Each function of this project is explained in detail. Starting from Section 3 onwards, the code blocks in these sections are hidden by default but can be viewed by opening up the “Show Code” folding subsection at the top of each section. Additionally, the code can also be found on my GitHub linked here.

1.0 | Introduction

In the previous blog post, we covered the development of a prototype for this project, which we managed to get working in a notebook environment. We knew there was still a lot of work to be done before it would be finished. This post will cover the steps I took to transform this project from its barebones prototype state to a full-fledged web application.

The transition from a notebook environment to a web application involves several critical steps, including selecting the right framework, ensuring scalability, and enhancing user interaction. We will delve into the rationale behind choosing Chainlit as the framework, given its robust features for building conversational AI interfaces. Additionally, we’ll discuss the implementation of asynchronous functionality and session management to support multiple users effectively. By the end of this post, you will have a comprehensive understanding of how to develop and deploy a sophisticated web application capable of providing detailed explanations of research papers using advanced AI techniques.

You can visit and use the deployed application by heading to arxivgpt.net.

2.0 | A Review of Parts 1 & 2

In Part 1 of this series, we laid the groundwork for the arXiv RAG project by introducing its motivation and outlining the plan. We delved into the reasons behind the exponential growth of research publications and highlighted the unique attributes of the arXiv platform that facilitate rapid dissemination of new research. We also discussed the importance of creating a research paper learning supplement rather than a simple paper finder, leveraging language models to make complex concepts and methodologies more accessible to students and enthusiasts. We outlined the metadata pipeline, including downloading, preprocessing, and uploading metadata to ensure the system stays up-to-date with the latest research papers.

Part 2 continued by bringing together the components to construct a working prototype of the application. We reviewed the essential libraries and global variables used in the project, detailing the imports, helper functions, and the main function. We demonstrated the automatic process of downloading, loading, chunking, embedding, and uploading research papers using a combination of the arXiv library and LangChain. The helper functions ensured seamless integration of various tasks such as checking for existing chunks, processing user queries, and querying OpenAI for context-based responses. The main function tied everything together, allowing users to interact with the system by entering paper titles, selecting documents, and receiving AI-generated responses based on their queries.

3.0 | Selecting the Right Framework

3.1 | Evaluating Initial Frameworks

The first thing we needed to do was select a Python framework to create an interactive web application. There are various frameworks of this kind available, each with their own distinct implementations and characteristics.

Initially, I experimented with Gradio and Streamlit. Gradio’s interface did not meet the design and functionality requirements I had in mind, and it also lacked several crucial features necessary for my project. Streamlit, on the other hand, offered a more appealing appearance, but it had inherent limitations in its design. Streamlit re-runs the entire application from top to bottom each time a user interacts with it, which results in a lack of native asynchronous support. This was a significant drawback for a project that requires handling multiple user sessions simultaneously and efficiently. Therefore, I needed a framework that could meet these requirements more effectively.

3.2 | Discovering Chainlit

After coming across Chainlit, I immediately knew it was exactly what I needed for this project. Chainlit is a free and open-source Python framework that allows for the creation of asynchronous, high-quality, and multimodal Conversational AI in a beautiful and customizable ChatGPT-esque user interface. It offers features such as text generation streaming, the Ask User API for user input, multi-modality for audio, spontaneous file uploads, image processing, persistent chat history, and user authentication. Additionally, Chainlit integrates with tools like FastAPI, OpenAI, and LangChain, enabling features like visualized chain-of-thought. After struggling to build a robust Conversational AI application with Streamlit or Gradio, I realized Chainlit was the ideal framework for this arXiv RAG project and any future conversational AI applications.

4.0 | Building the Application

4.1 | Initializing our Embedding Model, Vector Stores & Text Splitter

Show Code
# Function to initialize embedding model
def initialize_embeddings():
    """Initialize the OpenAI embedding model."""
    logger.info("Initializing OpenAI embeddings...")
    return OpenAIEmbeddings(model="text-embedding-3-small")

# Function to initialize vector stores
def initialize_vector_stores(embedding_model):
    """Initialize Pinecone vector stores for metadata and chunks."""
    logger.info("Initializing Pinecone vector stores...")
    metadata_vector_store = PineconeVectorStore.from_existing_index(embedding=embedding_model, index_name="arxiv-rag-metadata")
    chunks_vector_store = PineconeVectorStore.from_existing_index(embedding=embedding_model, index_name="arxiv-rag-chunks")
    return metadata_vector_store, chunks_vector_store

# Function to initialize text splitter
def initialize_text_splitter():
    """Initialize the recursive character text splitter."""
    logger.info("Initializing text splitter...")
    return RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=50,
        length_function=len,
        is_separator_regex=False
    )

It’s important that when building the different functionalities for this web application, which we will ultimately deploy, we remember to build it with care and attention to the asynchronous and session-based nature of the application. Remember, we don’t want this to just work one time on one person’s device, we want it to work with many sessions by multiple people, taking into consideration that these different people may be performing different actions within the application, and we don’t want that to affect another user’s session. For example, one user may request a paper that isn’t yet in the chunks vector store and thus must be downloaded, processed, and uploaded, all while another user may be requesting a paper that is in the chunks vector store and thus API calls to the embedding model, Pinecone vector store, and the inference model must all be made. We need to ensure that these operations can be done simultaneously for different people in different sessions. In other words, we need to design the application such that asynchronous function calls and session-based states are isolated through and through.

The first function initialize_embeddings simply provides an initialization message in the logs and returns a LangChain OpenAIEmbeddings model set to the 1536 dimension text-embedding-3-small embedding model. Next, we define the initialize_vector_stores function, which takes in an embedding_model parameter. This parameter is set to the embedding model returned from the initialize_embeddings function, which we call and set in the main function. This function also provides a verification log and uses LangChain’s Pinecone integration to set two variables, metadata_vector_store and chunks_vector_store, to their respective indexes, passing through the embedding_model. We then define the initialize_text_splitter, which also provides a log and returns a RecursiveCharacterTextSplitter LangChain object. For this object, we specify a chunk_size of 1000 characters, a chunk_overlap of 50 characters, set the length_function to len, and set is_separator_regex to False.

4.2 | Send Actions Function

Show Code
async def send_actions():
    """Send action options to the user."""
    actions = [
        cl.Action(name="ask_followup_question", value="followup_question", description="Uses The Previously Retrieved Context", label="Ask a Follow-Up Question"),
        cl.Action(name="ask_new_question", value="new_question", description="Retrieves New Context", label="Ask a New Question About the Same Paper"),
        cl.Action(name="ask_about_new_paper", value="new_paper", description="Ask About A Different Paper", label="Ask About a Different Paper")
    ]
    await cl.Message(content="### Please Select One of the Following Options:", actions=actions).send()

There are various points throughout this application where we want to display different options to the user, allowing them to direct the flow of the application. This could be after the LLM’s response has finished generating or after the user presses the ‘stop generation’ button. The send_actions function sets an actions variable to a list of Chainlit actions, passing through as arguments their corresponding name, value, description, and label. It also sends a message to the screen, formatted in Markdown, asking the user to ‘Please Select One of the Following Options’, in addition to displaying the previously defined list of actions. By setting up the function this way, whenever we want to send this message and the action buttons to the screen, we can simply call the send_actions function, which will then execute the requisite functionality.

4.3 | On Stop Function

Show Code
async def on_stop():
    """Handle session stop event to clean up tasks."""
    streaming_task = user_session.get('streaming_task')
    if streaming_task:
        streaming_task.cancel()
        await send_actions()
    user_session.set('current_document_id', None)
    user_session.set('streaming_task', None)
    logger.info("Session stopped and cleaned up.")

The on_stop function is an asynchronous function designed to handle the session stop event triggered by the “stop generation” button, which can be clicked while the LLM is generating the answer to the user query. When this button is pressed, the function retrieves the streaming_task from the user_session. If a streaming_task exists, it cancels the task and calls the send_actions function to display the appropriate options to the user. It then sets the current_document_id and streaming_task in the user_session to None, effectively clearing these values. Finally, it logs an informational message indicating that the session has been stopped and cleaned up.

4.4 | Main Function

Show Code
async def main():
    """Main function to start the chat session."""
    embedding_model = initialize_embeddings()
    metadata_vector_store, chunks_vector_store = initialize_vector_stores(embedding_model)
    text_splitter = initialize_text_splitter()

    user_session.set('embedding_model', embedding_model)
    user_session.set('metadata_vector_store', metadata_vector_store)
    user_session.set('chunks_vector_store', chunks_vector_store)
    user_session.set('text_splitter', text_splitter)
    user_session.set('current_document_id', None)

    text_content = """## Welcome to the arXiv Research Paper Learning Supplement

This system is connected to the live stream of papers being uploaded to arXiv daily.

### Instructions

1. **Enter the Title**: Start by entering the title of the research paper you wish to learn more about.
2. **Select a Paper**: Choose a paper from the list of retrieved papers.
3. **Database Check**: The system will check if the research paper exists in the research paper content database.
   - If it exists, you will be prompted to enter your question.
   - If it does not exist, the program will download the paper to the database and then ask you to enter your question.
4. **Read the Answer**: After reading the answer, you will have the following options:
   - Ask a follow-up question.
   - Ask a new question about the same paper.
   - Ask a new question about a different paper.

### Get Started
When You're Ready, Follow the First Step Below.
"""
    await cl.Message(content=text_content).send()

    await ask_initial_query()

The main function is the main function that starts the chat session. It initializes the embedding model by calling initialize_embeddings(), and then sets up the metadata and chunks vector stores by calling initialize_vector_stores with the embedding model as an argument. Additionally, it initializes the text splitter using initialize_text_splitter().

Once these components are initialized, the function sets various session variables in user_session, including embedding_model, metadata_vector_store, chunks_vector_store, text_splitter, and current_document_id (set to None).

It then defines a text content variable containing a welcome message and instructions for the user. This message explains how to interact with the system, which is connected to the live stream of papers being uploaded to arXiv daily. The instructions guide the user through entering a paper title, selecting a paper, checking the database, and asking questions.

Finally, the function sends this welcome message to the user using cl.Message(content=text_content).send() and calls ask_initial_query() to prompt the user to start interacting with the system.

4.5 | Ask Initial Query Function

Show Code
async def ask_initial_query():
    """Prompt the user to enter the title of the research paper."""
    res = await cl.AskUserMessage(content="### Please Enter the Title of the Research Paper You Wish to Learn More About:", timeout=3600).send()
    if res:
        initial_query = res['output']
        metadata_vector_store = user_session.get('metadata_vector_store')
        logger.info(f"Searching for metadata with query: {initial_query}")
        search_results = metadata_vector_store.similarity_search(query=initial_query, k=5)
        logger.info(f"Metadata search results: {search_results}")
        selected_doc_id = await select_document_from_results(search_results)
        if selected_doc_id:
            user_session.set('current_document_id', selected_doc_id)
            chunks_exist = await do_chunks_exist_already(selected_doc_id)
            if not chunks_exist:
                await process_and_upload_chunks(selected_doc_id)
            else:
                await ask_user_question(selected_doc_id)

The ask_initial_query function is an asynchronous function that prompts the user to enter the title of a research paper they wish to learn more about. It sends a message to the user asking for the title and waits for their response with a timeout of 3600 seconds. If a response is received, it extracts the initial query from the user’s input.

The function then retrieves the metadata_vector_store from the user_session and logs the search query. It performs a similarity search on the metadata vector store using the initial query, retrieving the top 5 results. These search results are logged for reference.

Next, the function prompts the user to select a document from the search results by calling select_document_from_results. If the user selects a document, the function sets the current_document_id in the user_session to the selected document ID.

The function then checks if the document chunks already exist by calling do_chunks_exist_already with the selected document ID. If the chunks do not exist, it processes and uploads the chunks by calling process_and_upload_chunks. If the chunks already exist, it prompts the user to ask a question about the document by calling ask_user_question.

4.6 | Ask User Question Function

Show Code
async def ask_user_question(document_id):
    """Prompt the user to enter a question about the selected document."""
    logger.info(f"Asking user question for document_id: {document_id}")
    context, user_query = await process_user_query(document_id)
    if context and user_query:
        task = asyncio.create_task(query_openai_with_context(context, user_query))
        user_session.set('streaming_task', task)
        await task

The ask_user_question function is an asynchronous function that prompts the user to enter a question about the selected document. It starts by logging the action with the document ID.

The function then calls process_user_query with the document ID, which processes the user’s query and retrieves the context and user query. If both context and user query are successfully retrieved, it creates an asynchronous task to query OpenAI with the provided context and user query by calling query_openai_with_context.

This task is then stored in the user_session under the key streaming_task. Finally, the function awaits the completion of this task, ensuring that the query is processed and the response is handled accordingly.

4.7 | Select Document From Results Function

Show Code
async def select_document_from_results(search_results):
    if not search_results:
        await cl.Message(content="No Search Results Found").send()
        return None

    message_content = "### Please Enter the Number Corresponding to Your Desired Paper:\n"
    message_content += "| No. | Paper Title | Doc. ID |\n"
    message_content += "|-----|-------------|---------|\n"

    for i, doc in enumerate(search_results, start=1):
        page_content = doc.page_content
        document_id = doc.metadata['document_id']
        message_content += f"| {i} | {page_content} | {document_id} |\n"

    await cl.Message(content=message_content).send()

    while True:
        res = await cl.AskUserMessage(content="", timeout=3600).send()
        if res:
            try:
                user_choice = int(res['output']) - 1
                if 0 <= user_choice < len(search_results):
                    selected_doc_id = search_results[user_choice].metadata['document_id']
                    selected_paper_title = search_results[user_choice].page_content
                    await cl.Message(content=f"\n**You selected:** {selected_paper_title}").send()
                    return selected_doc_id
                else:
                    await cl.Message(content="\nInvalid Selection. Please enter a valid number from the list.").send()
            except ValueError:
                await cl.Message(content="\nInvalid input. Please enter a number.").send()
        else:
            await cl.Message(content="\nNo selection made. Please enter a valid number from the list.").send()

The select_document_from_results function is an asynchronous function that prompts the user to select a document from a list of search results. If no search results are found, it sends a message to the user indicating that no results were found and returns None.

If search results are available, the function constructs a message content string that lists the available papers with corresponding numbers, titles, and document IDs. This message is formatted as a Markdown table and sent to the user.

The function enters a loop, awaiting the user’s input with a timeout of 3600 seconds. When a response is received, it attempts to convert the input to an integer corresponding to the selected paper’s number. If the input is valid and within the range of available search results, the function retrieves the selected document ID and paper title. It then sends a confirmation message to the user, indicating their selection, and returns the selected document ID.

If the input is invalid, such as being out of range or not a number, it sends an error message and prompts the user to enter a valid selection again. This process repeats until a valid selection is made.

4.8 | Do Chunks Exist Already Function

Show Code
async def do_chunks_exist_already(document_id):
    """Check if chunks for the document already exist."""
    chunks_vector_store = user_session.get('chunks_vector_store')
    filter = {"document_id": {"$eq": document_id}}
    test_query = chunks_vector_store.similarity_search(query="Chunks Existence Check", k=1, filter=filter)
    logger.info(f"Chunks existence check result for document_id {document_id}: {test_query}")
    return bool(test_query)

The do_chunks_exist_already function is an asynchronous function that checks if chunks for a specified document already exist. It retrieves the chunks_vector_store from the user_session.

The function constructs a filter to match the document_id and performs a similarity search on the chunks_vector_store with a query of “Chunks Existence Check” and a k value of 1, using the constructed filter.

The result of this search is logged, indicating whether chunks exist for the specified document ID. The function returns True if the search query returns any results, indicating that chunks already exist, and False otherwise.

4.9 | Download PDF & Process and Upload Chunks Functions

Show Code
async def download_pdf(session, document_id, url, filename):
    """Download the PDF file asynchronously."""
    logger.info(f"Downloading PDF for document_id: {document_id} from URL: {url}")
    async with session.get(url) as response:
        if response.status == 200:
            async with aiofiles.open(filename, mode='wb') as f:
                await f.write(await response.read())
            logger.info(f"Successfully downloaded PDF for document_id: {document_id}")
        else:
            logger.error(f"Failed to download PDF for document_id: {document_id}, status code: {response.status}")
            raise Exception(f"Failed to download PDF: {response.status}")

async def process_and_upload_chunks(document_id):
    """Download, process, and upload chunks of the document."""
    await cl.Message(content="#### It seems that paper isn't currently in our database. Don't worry, we are currently downloading, processing, and uploading it. This will only take a few moments.").send()
    await asyncio.sleep(2)

    try:
        # Create an async session for downloading
        async with ClientSession() as session:
            paper = await asyncio.to_thread(next, arxiv.Client().results(arxiv.Search(id_list=[str(document_id)])))
            url = paper.pdf_url
            filename = f"{document_id}.pdf"
            await download_pdf(session, document_id, url, filename)

        # Load and split the PDF into pages
        loader = PyPDFLoader(filename)
        pages = await asyncio.to_thread(loader.load)

        # Process and split pages into chunks
        text_splitter = user_session.get('text_splitter')
        content = []
        found_references = False

        for page in pages:
            if found_references:
                break
            page_text = page.page_content
            if "references" in page_text.lower():
                content.append(page_text.split("References")[0])
                found_references = True
            else:
                content.append(page_text)

        full_content = ''.join(content)
        chunks = text_splitter.split_text(full_content)

        # Ensure embedding model is initialized
        embedding_model = user_session.get('embedding_model')
        if not embedding_model:
            raise ValueError("Embedding model not initialized")

        # Upload chunks to Pinecone asynchronously
        chunks_vector_store = user_session.get('chunks_vector_store')
        await asyncio.to_thread(
            chunks_vector_store.from_texts,
            texts=chunks,
            embedding=embedding_model,
            metadatas=[{"document_id": document_id} for _ in chunks],
            index_name="arxiv-rag-chunks"
        )

        # Clean up the downloaded PDF file asynchronously
        await aiofiles.os.remove(filename)
        logger.info(f"Successfully processed and uploaded chunks for document_id: {document_id}")

        # Ensure the transition to asking a question happens
        await ask_user_question(document_id)

    except Exception as e:
        logger.error(f"Error processing and uploading chunks for document_id {document_id}: {e}")
        await cl.Message(content="#### An error occurred during processing. Please try again.").send()
        return

The download_pdf function is an asynchronous function that downloads a PDF file. It logs the action of downloading the PDF for the specified document_id from the provided url. Using an asynchronous session, it makes a GET request to the URL. If the response status is 200, it writes the content to a file asynchronously. Upon successful download, it logs a success message. If the download fails, it logs an error message and raises an exception.

The process_and_upload_chunks function is an asynchronous function that downloads, processes, and uploads chunks of a document. It starts by sending a message to the user, indicating that the paper is being processed, and waits for 2 seconds.

Within a try block, it creates an asynchronous session to download the paper’s PDF using the download_pdf function. It loads and splits the PDF into pages using PyPDFLoader. The function processes the pages, stopping when it encounters references, and concatenates the content.

The function retrieves the text_splitter from the user_session and splits the concatenated content into chunks. It ensures the embedding model is initialized and retrieves it from the user_session.

The chunks are uploaded to Pinecone asynchronously, with each chunk associated with the document_id. After uploading, the function asynchronously deletes the downloaded PDF file and logs a success message.

If any exceptions occur during the process, it logs an error message and sends a message to the user, indicating that an error occurred during processing. Finally, it calls ask_user_question to transition the user to the next step.

4.10 | Process User Query Function

Show Code
async def process_user_query(document_id):
    """Process the user's query about the document."""
    res = await cl.AskUserMessage(content="### Please Enter Your Question:", timeout=3600).send()
    if res:
        user_query = res['output']
        context = []
        chunks_vector_store = user_session.get('chunks_vector_store')
        filter = {"document_id": {"$eq": document_id}}
        attempts = 5  # Number of attempts to check for context
        for attempt in range(attempts):
            search_results = chunks_vector_store.similarity_search(query=user_query, k=15, filter=filter)
            context = [doc.page_content for doc in search_results]
            if context:
                break
            logger.info(f"No context found, retrying... (attempt {attempt + 1}/{attempts})")
            await asyncio.sleep(2)  # Wait before retrying

        logger.info(f"User query processed. Context: {context}, User Query: {user_query}")
        return context, user_query
    return None, None

The process_user_query function is an asynchronous function that processes the user’s query about the document. It prompts the user to enter their question with a message that has a timeout of 3600 seconds. If a response is received, it extracts the user’s query from the response and initializes an empty context list.

The function retrieves the chunks_vector_store from the user_session and sets a filter to match the document_id. It then attempts to find relevant context for the user’s query by performing a similarity search on the chunks_vector_store. The search is repeated up to 5 times, each time retrieving up to 15 results that match the query. If no context is found, it logs a message and waits for 2 seconds before retrying.

Once context is found or all attempts are exhausted, the function logs the processed user query and context. It returns the context and the user’s query. If no response is received from the user, it returns None for both context and user query.

4.11 | Query OpenAI With Context Function

Show Code
async def query_openai_with_context(context, user_query):
    """Query OpenAI with the context and user query."""
    if not context:
        await cl.Message(content="No context available to answer the question.").send()
        return

    client = AsyncOpenAI()

    settings = {
        "model": "gpt-4o",
        "temperature": 0.3,
        "top_p": 1,
        "frequency_penalty": 0,
        "presence_penalty": 0,
    }

    message_history = [
        {"role": "system", "content": """
         Your job is to answer the user's query using only the provided context.
         Be detailed and long-winded. Format your responses in markdown formatting, making good use of headings,
         subheadings, ordered and unordered lists, and regular text formatting such as bolding of text and italics.
         Sometimes the equations retrieved from the context will be formatted improperly in an incompatible format
         for correct markdown rendering. Therefore, if you ever need to provide equations, make sure they are
         formatted properly using LaTeX, wrapping the equation in single dollar signs ($) for inline equations
         or double dollar signs ($$) for bigger, more visual equations. Keep your answer grounded in the facts
         of the provided context. If the context does not contain the facts needed to answer the user's query, return:
         "I do not have enough information available to accurately answer the question."
         """},
        {"role": "user", "content": f"Context: {context}"},
        {"role": "user", "content": f"Question: {user_query}"}
    ]

    msg = cl.Message(content="")
    await msg.send()

    async def stream_response():
        stream = await client.chat.completions.create(messages=message_history, stream=True, **settings)
        async for part in stream:
            if token := part.choices[0].delta.content:
                await msg.stream_token(token)

    streaming_task = asyncio.create_task(stream_response())
    user_session.set('streaming_task', streaming_task)

    try:
        await streaming_task
    except asyncio.CancelledError:
        streaming_task.cancel()
        return

    await msg.update()

    await send_actions()

The query_openai_with_context function is an asynchronous function that queries OpenAI with the provided context and user query. If no context is available, it sends a message to the user indicating that there is no context to answer the question and returns.

The function initializes an AsyncOpenAI client and sets the query settings, including the model, temperature, top_p, frequency penalty, and presence penalty. It constructs the message history with a system message instructing the AI on how to respond, including formatting guidelines and ensuring the use of LaTeX for equations. It also includes the context and user query in the message history.

A message object msg is created and sent to the user. The function defines an asynchronous stream_response function that streams the response from the OpenAI chat completion, sending each token to the user as it is generated.

The stream_response function is executed as a task, and this task is stored in the user_session under the key streaming_task. The function waits for the streaming task to complete, handling cancellation if necessary. Once the task is completed, it updates the message and calls send_actions to display the appropriate options to the user.

4.12 | Action Callback Functions

Show Code
# Action callbacks
async def handle_followup_question(action):
    """Handle follow-up question action."""
    current_document_id = user_session.get('current_document_id')
    if current_document_id:
        context, user_query = await process_user_query(current_document_id)
        if context and user_query:
            task = asyncio.create_task(query_openai_with_context(context, user_query))
            user_session.set('streaming_task', task)
            await task

async def handle_new_question(action):
    """Handle new question action."""
    current_document_id = user_session.get('current_document_id')
    if current_document_id:
        await ask_user_question(current_document_id)

async def handle_new_paper(action):
    """Handle new paper action."""
    await ask_initial_query()

The handle_followup_question function handles the follow-up question action. It retrieves the current document ID from the user_session and processes the user’s query about the document using the process_user_query function. If context and user query are obtained, it creates an asynchronous task to query OpenAI with the provided context and user query. This task is set in the user_session under streaming_task and awaited for completion.

The handle_new_question function handles the action to ask a new question about the current document. It retrieves the current document ID from the user_session and calls the ask_user_question function, which prompts the user to enter a new question.

The handle_new_paper function handles the action to ask about a new paper. It calls the ask_initial_query function, which prompts the user to enter the title of the research paper they wish to learn more about.

These callback functions are connected to the actions which were defined in the send_actions function, which is used to display various options to the user at different points in the application flow. When an action is selected by the user, the corresponding callback function is invoked to handle the specific user request, which ensures a smooth and interactive user experience.

5.0 | Application Deployment

5.1 | Deploying With Fly.io

Now that we have completed creating the application, it’s time to think about how and where to deploy it. Reviewing the Chainlit Deployment Documentation, we see that they provide guides for deploying a Chainlit application to various platforms, including Fly.io, HuggingFace Spaces, AWS, and more.

When selecting a platform for deployment, I wanted to strike a balance between simplicity and customizability. I also preferred having the application hosted on its own with a direct URL to the web app, rather than being a space on HuggingFace Spaces. While AWS and Google Cloud Engine are reliable options, they involve additional setup and complexity that I wanted to avoid for this project. This led me to choose Fly.io, which I found to be the perfect balance between simplicity and customizability.

Following the cookbook guide on the Chainlit GitHub, the first step was to create an account with Fly.io. Then, I needed to install the Fly CLI, which is used for both the initial deployment of the app and for deploying updates. Next, I ran flyctl auth login, which logged me into my account in the command line. From here, I navigated to the directory where my Chainlit application was stored on my PC, created a Procfile with web: python -m chainlit run app.py -h --port 8080 inside it, and also created a requirements.txt document. This was made easy by using the pipreqs library; simply running the pipreqs . command from within the project directory generates a perfect requirements.txt file.

The next steps were straightforward. I created my Fly project using the flyctl launch command, answered “No” to all prompts during setup, deployed the app with flyctl deploy, and set the number of machines dedicated to the project to one using the flyctl scale count 1 command. This streamlined process ensured that my application was up and running efficiently with minimal hassle.

6.0 | Video Demonstration

Below, you will find a video demonstrating the finalized and deployed version of the arXiv RAG project. This video provides an overview of the application’s features, functionality, hosting provider, and how to navigate and use it effectively.

7.0 | Project Conclusion

There we have it, a fully deployed Conversational AI application using the Chainlit framework, OpenAI embedding and inference models, LangChain, and Pinecone, connected to a custom-built metadata pipeline. This tool is designed to help students, enthusiasts, and researchers learn about, consume, and understand the latest research being published on arXiv. This is the first open-source project I have undertaken, and I have learned so much about programming, interacting with APIs, error management, asynchronous programming practices, prompt engineering, DevOps/MLOps, and working with the latest NLP-focused libraries like LangChain and Chainlit, among many other things. Thank you for joining me through the motivation and conception of the idea in part 1, creating the first notebook prototype in part 2, and ultimately transforming it into a deployed web app here in part 3.

I look forward to doing more work and sharing it with all of you. In the coming weeks, I will be focusing on writing about some of the ideas and concepts I have recently learned about and find the most captivating in the AI/ML sphere. Once again, thank you and take care.

References

“An Introduction to RAG in LLMs.” 2024. Jack Tol. 2024. https://jacktol.net/posts/an_introduction_to_rag_in_llms/.
“arXiv RAG Project Part 1.” 2024. Jack Tol. 2024. https://jacktol.net/posts/arxiv_rag_project_part_1/.
“arXiv RAG Project Part 2.” 2024. Jack Tol. 2024. https://jacktol.net/posts/arxiv_rag_project_part_2/.
“arXiv Research Assistant RAG GitHub Repository.” 2024. Jack Tol. 2024. https://github.com/jack-tol/arxiv-research-assistant-rag.
“Chainlit Documentation: Get Started Overview.” 2024. Chainlit. 2024. https://docs.chainlit.io/get-started/overview.
“Chainlit GitHub Repository.” 2024. Chainlit. 2024. https://github.com/Chainlit/chainlit.
“Fly.io Documentation.” 2024. Fly.io. 2024. https://fly.io/docs/.
“Pipreqs: Generate Pip Requirements.txt File Based on Imports in Your Project.” 2024. PyPi. 2024. https://pypi.org/project/pipreqs/.