arXiv RAG Project Part 1 | Project Introduction and the Metadata Pipeline

An In-Depth Introduction to the arXiv RAG Project and the Metadata Pipeline
ai
machine learning
project
deep learning
nlp
rag
Author

Jack Tol

Published

May 4, 2024

Note

This blog post is available in audio format as well. You can listen using the player below or download the mp3 file for listening at your convenience.

Quick Tip!

This blog post assumes prior knowledge of language models and RAG. If you aren’t familiar with these topics, check out my previous blog post titled An Introduction to RAG in LLMs.

Important Note!

From Section 2 onwards, the code for the metadata downloading, preprocessing, and uploading is explained in detail. The code blocks can be viewed by opening up the “Show Code” folding subsection, which is present at the top of such sections.

1.0 | Project Introduction

1.1 | The Motivation

In recent years, we have seen an explosion in the amount of research and publications being uploaded by individuals and institutions all across the world. It makes sense that in a world that is richer, better educated, and better connected, the amount of research and development, especially in STEM and medical fields, is rapidly increasing.

Consider Figure 1:

Figure 1: Graph displaying the annual number of research papers uploaded, from the starting year to 2023.

Looking at Figure 1, it’s clear that the annual number of research papers uploaded to arXiv from its inception through to 2023 is increasing precipitously.

There are many scientific research publishing platforms out there, but arXiv has a unique and distinct attribute not present in many other platforms. Research uploaded to platforms such as PubMed or the Institute of Electrical and Electronics Engineers (IEEE) is subject to a long and laborious process of checking and verification before publication. This in itself is a very good thing, as this process sorts and filters through proposed research, and it only gets published if the organization deems it to be unique, helpful, informative, reliable, and genuine.

There is, however, a big drawback to this approach: time. It can take a long time for research to be officially published and available for reading, as stated by ResearchHub, University of Auckland,

‘After submission, the processing time for each of the steps above varies. The shortest turnaround time, especially with priority “hot topic” research, is about 9 weeks. However, it may take up to 20 (or more) weeks, depending on the amount of revision required and the processing speed of the editorial team. In some disciplines, the publication process may take a year or more from start to finish.’

This minimum 2-month and up to one-year bottleneck adds up over time and has the effect of hindering quick, agile research and development of new ideas, concepts, and implementations.

One benefit of a preprint research paper platform such as arXiv is that it allows individuals and teams of researchers from institutions all around the world to upload and share their work within the industry. On arXiv, the average turnaround time can range from about 1 to 3 days, depending on the submission day. This enables more people to utilize these amazing discoveries and implementations in their own work.

There is, however, an important nuance to consider here. Generally, I like to think of research as falling into two main categories. The first can be thought of as ‘Practical research,’ that is, it has consequential implications and can potentially affect the lives of real people in negative ways. The second type can be thought of as ‘Theoretical research,’ that is, it is purely results-driven and doesn’t have any immediate personal consequences for anyone. I think, while we should always strive to ensure research is high quality, reliable, and bias-free, certain special considerations and regulations should be in place for the first type of research, which shouldn’t be necessary for the second type.

As an example, research published by medical publications, covering new and exciting developments in fields like medicine, psychology, and psychiatry, is extremely consequential. Clinicians who diagnose and treat patients are constantly informing themselves about the latest published research. For industries like these, it is crucial to ensure that no flawed papers slip through the cracks of the publication system, as this directly affects and influences the quality of patient care.

The same can be said for other fields such as engineering and aviation, where the lives of hundreds of millions of fliers, and the billions who use the world’s infrastructure are at stake. However, the same cannot be said about more theoretical fields such as mathematics or computer science. Industries like these can allow for quick, agile research and development of new concepts, while also benefiting immensely from this approach without compromise.

Consuming research from any platform, especially a preprint platform, exposes you to more research in a single day than you could ever hope to get through. That’s why my application is not designed to be a research paper finder, but rather a research paper learning supplement. Before using the platform, you should already know which paper you intend to learn about.

Research papers can often be quite challenging to get through. So much so that there are entire courses and lectures dedicated to the topic of how to effectively read and understand a research paper. The ideas, concepts, and methodologies can sometimes be difficult to grasp. This is why I believe students and enthusiasts would benefit from an application that is natively built to explain the ideas within the paper, making difficult-to-understand concepts and methodologies more accessible and easier to understand.

One of the advantages of using a language model is that it allows us to leverage the amazing explainability and flexibility that come hand-in-hand with language models. I’m sure many of you have experimented with a language model like ChatGPT, asking it to explain difficult concepts “like I’m 10 years old,” and were amazed by the model’s ability to capture the complex essence and nature of the topic being explained, while tailoring the delivery to be accessible and understandable to the user.

And the best part is, this doesn’t need to end with mere question-answering capabilities. By considering a more complex RAG pipeline setup and employing advanced prompt engineering techniques, we can potentially future-proof this application by exploring more complex functionalities. These could include allowing users to provide their pre-existing work and leveraging the user query, retrieved context, and the language model to seamlessly integrate new concepts and methodologies into their own work.

1.2 | Outlining the Plan

Before getting into the details, let’s take a look at the diagram outlining the overall plan and flow of information for this project.

Consider the following diagram:

flowchart TB
    A1["Ask User To Enter The Title For The Paper They Wish To Learn More About"] -- User Entered Paper Title --> A2["Embedding Model"]
    A2 -- Embedded Paper Title For Searching --> A3["Metadata Vector Store"]
    B1["Download Raw arXiv Metadata"] -- Raw Metadata --> B2["Metadata Processer"]
    B2 -- Cleaned Metadata --> B3["Embedding Model"]
    B3 -- Metadata Embeddings Ready For Storing in The Metadata Vector Store --> A3
    A3 -- Top 5 Papers Which The User is Most Likely Referring To --> C1["The Retrieved Options Displayed To The User"]
    C1 --> Cx1["User Selects Desired Paper From The Presented Options"]
    Cx1 -- The Document_ID For The Paper Selected By The User --> C2{"Do Chunks Exist
    For That Paper
    in the arXiv
    Chunks Vector Store?"}
    C2 -- Yes --> C3["Ask User To Enter Their Query"]
    C3 --> Cx2["User Enters Their Query"] -- User Query --> C4["Embedding Model"]
    C4 -- Embedded User Query For Searching After Filtering Using Document_ID --> C5["Chunks Vector Store"]
    C5 -- "Top-k Chunks Based on Similarity Search With Respect to The User Query" --> C6["Chunks Concatenated Into Context"]
    C6 -- Context --> C7["Context + User Query Added To Prompt Template"]
    Cx2 -- User Query --> C7
    C7 -- Prompt Template --> C8["Inference Model"]
    C8 --> C9["Output"]
    C2 -- No --> D1["Remotely Download Paper"]
    D1 --> D2["LangChain To Load Paper"]
    D2 -- Each Pages Content --> D3["LangChain To Semantically Chunk Document"]
    D3 -- Paper Chunks --> D4["Embedding Model"]
    D4 -- Embedded Chunks --> D5["Chunks Vector Store"]
    D5 --> D6{"Do Chunks Exist
    For That Paper
    in the arXiv
    Chunks Vector Store?"}
    D6 -- Yes --> C3
    D6 -- No --> E1["Paper Retrieval Failed. Please Try Another Paper."]
    E1 --> A1
Figure 2: Flowchart Illustrating the Steps Involved For The arXiv RAG Project

The first thing we notice is that there are two different starting points to consider when understanding the steps involved and the logistical flow of the project: the “Raw arXiv Metadata” entry point and the “Ask User to Enter the Title for the Paper They Wish to Learn More About” entry point.

Starting from the “Raw arXiv Metadata” entry point, the first step is to download the raw arXiv metadata using the arXiv OAI-PMH service, preprocess it, embed it, and then upload each vector to the metadata vector store.

Next, considering the “Ask User to Enter the Title for the Paper They Wish to Learn More About” entry point, we first take the title for the paper the user entered, pass it through the same embedding model, and then use that vector to search for the top 5 papers the user is most likely referring to.

After storing the metadata and searching based on the paper title entered by the user, the Pinecone Vector Store returns the top 5 papers which the user is most likely referring to. We then display these options to the user and have the user select one of the papers from the presented options.

Once the user selects one of the papers from the list, we conduct a check to see whether chunks already exist for the user-selected paper inside the chunks vector store. If they do, we go straight to asking the user to enter their query about the paper. After the user enters their query, it gets passed directly to an embedding model. We then conduct a similarity search among the chunks for the paper the user selected (we filter only for chunks matching the document_id of the paper the user selected to ensure we are only including content from the paper the user wishes to learn more about). The Chunks Vector Store returns the top-k chunks based on the similarity search with respect to the user query. These returned chunks are then concatenated into context, the context along with the user query is passed to a prompt template, the prompt template is then passed to an inference model, and finally, the inference model returns the output which, in this case, is the answer to the user’s query using only the context which was returned from the similarity search among the document-specific chunks.

However, if no chunks exist, then we remotely download the paper, use LangChain to load the paper, use LangChain to semantically chunk the paper, pass each of the chunks to an embedding model, and then upload the chunk embedding vectors to the Pinecone Chunks Vector Store. We then conduct a subsequent check to see whether chunks exist for the user-selected paper in the Pinecone Chunks Vector Store to ensure that the chunks were successfully uploaded. If chunks still don’t exist for the paper, then the paper retrieval process failed, and we ask the user to try a different paper. If they do exist, then we proceed to ask the user to enter their query, and the rest of the steps are the same as those outlined above.

2.0 | Metadata Downloading, Preprocessing & Uploading

2.1 | Downloading the Metadata

Show Code
def download_metadata(from_date, until_date):
    connection = Sickle('http://export.arxiv.org/oai2')
    logging.info('Getting papers...')
    params = {'metadataPrefix': 'arXiv', 'from': from_date, 'until': until_date, 'ignore_deleted': True}
    data = connection.ListRecords(**params)
    logging.info('Papers retrieved.')

    iters = 0
    errors = 0

    with open('arXiv_metadata_raw.xml', 'a+', encoding="utf-8") as f:
        while True:
            try:
                record = next(data).raw
                f.write(record)
                f.write('\n')
                errors = 0
                iters += 1
                if iters % 1000 == 0:
                    logging.info(f'{iters} Processing Attempts Made Successfully.')

            except HTTPError as e:
                handle_http_error(e)

            except RequestException as e:
                logging.error(f'RequestException: {e}')
                raise

            except StopIteration:
                logging.info(f'Metadata For The Specified Period, {from_date} - {until_date} Downloaded.')
                break

            except Exception as e:
                errors += 1
                logging.error(f'Unexpected error: {e}')
                if errors > 5:
                    logging.critical('Too many consecutive errors, stopping the harvester.')
                    raise

def handle_http_error(e):
    if e.response.status_code == 503:
        retry_after = e.response.headers.get('Retry-After', 30)
        logging.warning(f"HTTPError 503: Server busy. Retrying after {retry_after} seconds.")
        time.sleep(int(retry_after))
    else:
        logging.error(f'HTTPError: Status code {e.response.status_code}')
        raise e

Downloading the metadata is probably one of the most important aspects of this project, as this process is responsible for ensuring that the system stays up-to-date with the latest research papers uploaded to arXiv. Instead of downloading and storing every single research paper which has been and will continue to be uploaded, an approach that is both financially and computationally expensive, and without any upsides, given that the majority of people will be requesting and learning about the same minority of papers, it makes much more sense to store only the research paper title and authors, in addition to the unique document_id for each paper. This approach will allow a user to enter the title for a paper they wish to learn about, select the paper from the list retrieved from the metadata database, and for the system to automatically download the paper the user has selected, load it, process it, and store its chunks as needed. This method also means that a paper only needs to be downloaded and processed once before it is permanently stored for retrieval, accommodating cases where different users want to learn about the same paper.

I found an out-of-date dataset containing arXiv research paper metadata on Kaggle. However, since the recency of the metadata is one of the most important aspects of this project, and because I knew I would need to perform daily subsequent metadata downloads, the first step was to review the metadata downloading options within the arXiv documentation.

Reading the arXiv documentation, the preferred way to bulk-download and keep an up-to-date copy of arXiv metadata for all articles was to use the arXiv OAI-PMH service, which is updated daily with metadata from new articles. This was perfect because I needed to perform both an initial bulk-download of all the metadata up until this point and a method to reliably perform subsequent daily downloads to ensure that I always had the most up-to-date metadata for the papers in my database.

With that established, to download the metadata, I first created a function called download_metadata which uses the Sickle library to connect to the arXiv OAI-PMH service to retrieve metadata for the research papers. It specifies a date range through the parameters from_date and until_date, which are defined in the main loop, and constructs a query using these along with other parameters like metadataPrefix, set to arXiv, and ignore_deleted set to True, ensuring it does not retrieve deleted records.

Once the parameters are set up, connection.ListRecords(**params) is executed to fetch the records. Successful data retrieval is logged with the logging.info('Papers retrieved.') statement. Each record fetched is written to the file arXiv_metadata_raw.xml using a while True loop, ensuring continuous retrieval until all records are fetched or an exception interrupts the process.

Error handling is a significant aspect of the download_metadata function. The loop attempts to process each record individually and logs every 1000 successful attempts using the line logging.info(f'{iters} Processing Attempts Made Successfully.').

Several types of exceptions are caught: HTTPError, RequestException, StopIteration, and a generic Exception. Each has specific logging actions, and for certain conditions, the function may terminate execution, such as after five consecutive unexpected errors, indicated by logging.critical('Too many consecutive errors, stopping the harvester.') and raising an exception.

Further, HTTPError exceptions are managed by a dedicated function I created, handle_http_error, which handles retries when the server is busy, which is very important to manage properly as we will be trying to download as many papers as possible before stopping for a period defined by the error returned from arXiv, captured in the retry_after variable.

Note

The link to arXiv’s metadata downloading options can be found within the documentation on their website. See the references section for more information.

2.2 | Inspecting the Raw Metadata

Now that we have downloaded the metadata, let’s take a look at the raw metadata to get a bit more of an idea on how to begin cleaning it up and processing it.

Let’s consider a singular entry from the archive_metadata_raw.xml file:

<record xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><header><identifier>oai:arXiv.org:0704.0495</identifier><datestamp>2024-02-13</datestamp><setSpec>math</setSpec><setSpec>physics:math-ph</setSpec><setSpec>physics:quant-ph</setSpec></header><metadata><arXiv xmlns="http://arxiv.org/OAI/arXiv/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://arxiv.org/OAI/arXiv/ http://arxiv.org/OAI/arXiv.xsd"><id>0704.0495</id><created>2007-04-04</created><updated>2007-07-02</updated><authors><author><keyname>Saniga</keyname><forenames>Metod</forenames><affiliation>ASTRINSTSAV</affiliation></author><author><keyname>Planat</keyname><forenames>Michel</forenames><affiliation>FEMTO-ST</affiliation></author><author><keyname>Pracna</keyname><forenames>Petr</forenames><affiliation>JH-Inst</affiliation></author><author><keyname>Havlicek</keyname><forenames>Hans</forenames><affiliation>TUW</affiliation></author></authors><title>The Veldkamp Space of Two-Qubits</title><categories>quant-ph math-ph math.MP</categories><comments>7 pages, 3 figures, 2 tables; Version 2 - math nomenclature
  fine-tuned and two references added; Version 3 - published in SIGMA
  (Symmetry, Integrability and Geometry: Methods and Applications) at
  http://www.emis.de/journals/SIGMA/</comments><journal-ref>SIGMA 3 (2007) 075, 7 pages</journal-ref><doi>10.3842/SIGMA.2007.075</doi><abstract>  Given a remarkable representation of the generalized Pauli operators of
two-qubits in terms of the points of the generalized quadrangle of order two,
W(2), it is shown that specific subsets of these operators can also be
associated with the points and lines of the four-dimensional projective space
over the Galois field with two elements - the so-called Veldkamp space of W(2).
An intriguing novelty is the recognition of (uni- and tri-centric) triads and
specific pentads of the Pauli operators in addition to the "classical" subsets
answering to geometric hyperplanes of W(2).
</abstract></arXiv></metadata></record>

Looking at the data, we can see this is not in a standard XML format, and this makes sense because, during the downloading process we specified that we wanted to download the metadata in the custom arXiv metadata format. The reason for this is due to the sheer amount of information we get from this format. As an example, if we had downloaded the metadata using the standard Simple Dublin Core XML format, the very important id tag, which holds the unique document_id for the article, doesn’t actually exist. The closest we get to the id tag is the dc:identifier tag, which contains the link to the paper <dc:identifier>http://arxiv.org/abs/0804.2273</dc:identifier>, in which the document_id is stored in the latter half of the link and would require some messy regular expression code to extract it. On the other hand, the arXiv format simply contains an id tag which stores this information, along with better author formatting and a whole bunch of other information which isn’t important for this project but is still good to have nevertheless.

Although having access to all of this information is very helpful, the custom XML format comes with the consequence of not being able to easily parse it using the standard methods and libraries available. Working with raw data like this can be quite challenging and uncomfortable, but as soon as we manage to parse this XML data into a Pandas Dataframe it will be smooth sailing from there.

2.3 | Metadata Wrangling

2.3.1 | Initial Processing

Show Code
def parse_xml_to_df(xml_file):
    with open(xml_file, 'r', encoding='utf-8') as file:
        xml_content = file.read()
    
    if not xml_content.strip().startswith('<root>'):
        xml_content = f"<root>{xml_content}</root>"

    root = ET.ElementTree(ET.fromstring(xml_content)).getroot()
    records = []
    ns = {
        'oai': 'http://www.openarchives.org/OAI/2.0/',
        'arxiv': 'http://arxiv.org/OAI/arXiv/'
    }
    
    for record in root.findall('oai:record', ns):
        data = {}
        header = record.find('oai:header', ns)
        data['identifier'] = header.find('oai:identifier', ns).text
        data['datestamp'] = header.find('oai:datestamp', ns).text
        data['setSpec'] = [elem.text for elem in header.findall('oai:setSpec', ns)]
        
        metadata = record.find('oai:metadata/arxiv:arXiv', ns)
        data['id'] = metadata.find('arxiv:id', ns).text
        data['created'] = metadata.find('arxiv:created', ns).text
        data['updated'] = metadata.find('arxiv:updated', ns).text if metadata.find('arxiv:updated', ns) is not None else None
        data['authors'] = [
            (author.find('arxiv:keyname', ns).text if author.find('arxiv:keyname', ns) is not None else None,
             author.find('arxiv:forenames', ns).text if author.find('arxiv:forenames', ns) is not None else None)
            for author in metadata.findall('arxiv:authors/arxiv:author', ns)
        ]
        data['title'] = metadata.find('arxiv:title', ns).text
        data['categories'] = metadata.find('arxiv:categories', ns).text
        data['comments'] = metadata.find('arxiv:comments', ns).text if metadata.find('arxiv:comments', ns) is not None else None
        data['report_no'] = metadata.find('arxiv:report-no', ns).text if metadata.find('arxiv:report-no', ns) is not None else None
        data['journal_ref'] = metadata.find('arxiv:journal-ref', ns).text if metadata.find('arxiv:journal-ref', ns) is not None else None
        data['doi'] = metadata.find('arxiv:doi', ns).text if metadata.find('arxiv:doi', ns) is not None else None
        data['license'] = metadata.find('arxiv:license', ns).text if metadata.find('arxiv:license', ns) is not None else None
        data['abstract'] = metadata.find('arxiv:abstract', ns).text.strip() if metadata.find('arxiv:abstract', ns) is not None else None
        
        records.append(data)
    df = pd.DataFrame(records)
    return df

In order to convert our XML data into a Pandas Dataframe the first thing we need to do is define a function, parse_xml_to_df, which is going to be responsible for processing the XML file containing our raw metadata which we just downloaded. The function begins by opening and reading the entire content of the XML file into the variable xml_content.

To ensure the XML is properly formatted, it conditionally wraps xml_content in a root tag if it does not already begin with one, using if not xml_content.strip().startswith('<root>').

Using the xml.etree.ElementTree module, the function parses this content, obtaining the root of the XML structure. It then initializes a list named records to hold data extracted from each XML record.

Next, the function iterates over each XML element tagged as oai:record, using the namespaces defined in ns. For each record, it extracts all relevant metadata from the structured XML.

Each record’s data is stored in a dictionary, data, which includes metadata such as identifiers, date stamps, set specifications, authors, titles, categories, and other attributes relevant to the research paper. Each field is then retrieved from the XML structure.

After processing all records, these dictionaries are compiled into a pandas DataFrame, df = pd.DataFrame(records), which represents the structured metadata extracted from the XML file containing the raw downloaded metadata.

This DataFrame is then returned, providing a structured representation of the original XML data suitable for further cleaning and processing.

2.3.2 | Data Cleaning and Organizing

Show Code
def preprocess_dataframe(df):
    df = df[['datestamp', 'id', 'created', 'authors', 'title']].copy()
    
    df.rename(columns={
        'datestamp': 'last_edited',
        'id': 'document_id',
        'created': 'date_created'
    }, inplace=True)
    
    df.loc[:, 'title'] = df['title'].astype(str)
    df.loc[:, 'authors'] = df['authors'].astype(str)
    
    df.loc[:, 'title'] = df['title'].str.replace('  ', ' ', regex=True)
    df.loc[:, 'authors'] = df['authors'].str.replace('  ', ' ', regex=True)
    
    df.loc[:, 'title'] = df['title'].str.replace('\n', '', regex=True)
    
    df.loc[:, 'authors'] = df['authors'].str.replace('[\[\]\'"()]', '', regex=True)

    def flip_names(authors):
        author_list = authors.split(', ')
        flipped_authors = []
        for i in range(0, len(author_list), 2):
            if i+1 < len(author_list):
                flipped_authors.append(f"{author_list[i+1]} {author_list[i]}")
        return ', '.join(flipped_authors)

    df.loc[:, 'authors'] = df['authors'].apply(flip_names)
    
    df.loc[:, 'last_edited'] = pd.to_datetime(df['last_edited'])
    df.loc[:, 'date_created'] = pd.to_datetime(df['date_created'])
    
    df = df[df['document_id'].str.match('^\d')]

    df = df[df['last_edited'] == df['date_created'] + pd.Timedelta(days=1)]
    
    df.loc[:, 'title_by_authors'] = df['title'] + ' by ' + df['authors']
    
    df.drop(['title', 'authors', 'date_created', 'last_edited'], axis=1, inplace=True)
    
    df.to_csv('metadata_processed.csv', index=False)
    return df
<>:18: SyntaxWarning: invalid escape sequence '\['
<>:33: SyntaxWarning: invalid escape sequence '\d'
<>:18: SyntaxWarning: invalid escape sequence '\['
<>:33: SyntaxWarning: invalid escape sequence '\d'
C:\Users\Jack\AppData\Local\Temp\ipykernel_20516\272546620.py:18: SyntaxWarning: invalid escape sequence '\['
  df.loc[:, 'authors'] = df['authors'].str.replace('[\[\]\'"()]', '', regex=True)
C:\Users\Jack\AppData\Local\Temp\ipykernel_20516\272546620.py:33: SyntaxWarning: invalid escape sequence '\d'
  df = df[df['document_id'].str.match('^\d')]

The next step is to take our Pandas DataFrame and clean up and organize the data to better suit the needs of this project. Initially, we drop all fields except for the five we need for this step of the RAG pipeline: datestamp, id, created, authors, and title. Next, we rename some of the fields for clarity and to better align with what they represent, changing datestamp and created to last_edited and date_created respectively. We then explicitly convert the title and authors fields to the string datatype to prevent any further parsing errors. Next, we go through both the title and authors fields and replace any double spaces with single spaces, remove any line breaks from any records within the title field, and also clean up the authors field by removing any square brackets, parentheses, single quotes, and double quotes from the text.

The next thing we need to do is correct the naming format within the authors field. By default, authors are formatted as “Last Name, First Name,” and they are listed in a comma-separated string. This format can be problematic when we need to display names in a more conversational or informal style, such as “First Name Last Name.” The flip_names function addresses this by flipping every pair of names in the list, which involves splitting the input string by commas to separate individual authors, iterating over these authors in steps of two, and then swapping each consecutive pair. If there is an odd number of names, the last name remains in its original position. Finally, the swapped names are rejoined into a single string, which ensures that the names are presented in the desired format.

After we have applied the flip_names function to the authors field, we then explicitly convert the last_edited and date_created columns to the datetime datatype for easy comparison. We then conduct a check to make sure that each record in the document_id field starts with an integer, dropping any rows where this isn’t the case.

When we specify a date range during the metadata downloading step, the papers captured include both newly uploaded papers and older papers that have been updated, revised, and republished. This usually isn’t an issue; however, for our application, we intend for this step of the RAG pipeline to function as an arXiv research paper search engine via a similarity search. Therefore, if we were to store all the papers initially downloaded, the system would quickly become cluttered with duplicates as more and more papers continue to be revised and republished. So, we need a way to address this issue. Rarely, if ever, is the title of a research paper changed, and the document_id is never changed. Hence, we want a method to filter only for newly uploaded papers. This is why we retained both the initial datestamp and created fields, which we quickly renamed to last_edited and date_created to better align with what they represent. last_edited refers to the date when any given paper has been uploaded, regardless of whether it is a brand-new paper or a mere revision of an old paper, while date_created refers to the date when the paper was originally created and submitted. A logical byproduct of this naming scheme is that brand-new papers are those where the last_edited and date_created dates are the same. In reality, new papers are those where the last_edited and date_created dates are the same, plus one day due to some server-side discrepancies and arXiv’s published-date naming conventions.

We then concatenate the title and authors fields into a new field named title_by_authors, formatting it as "Title by Authors" to provide a clear, combined descriptor of each document. Lastly, we drop the original title, authors, date_created, and last_edited fields, leaving us with a cleaned DataFrame containing only the document_id and the title_by_authors fields for each new arXiv research paper, ready for embedding and uploading to our vector database.

2.3.3 | Inspecting The Cleaned Metadata

You may have noticed at the end of the preprocess_dataframe function the DataFrame is also exported to a CSV file. This is for manual visual inspection of the data, allowing us to confirm that the metadata processing went as expected.

Let’s now take a look at a singular entry from the metadata_processed.csv file:

document_id,title_by_authors
2402.18851,"Applications of 0-1 Neural Networks in Prescription and Prediction by Vrishabh Patil, Kara Hoppe, Yonatan Mintz"

Wow! That looks way cleaner and is now in a format ready for uploading to our vector database.

2.4 | Uploading the Metadata

Show Code
def upload_to_pinecone(df, vector_store):
    texts = df['title_by_authors'].tolist()
    metadatas = df[['document_id']].to_dict(orient='records')
    vector_store.add_texts(texts=texts, metadatas=metadatas)

The vector database used for this project is Pinecone. A combination of the extensive Pinecone portal, fantastic support, native integration with LangChain, and the serverless option where you only pay for the queries and vectors you store and use made this decision a no-brainer.

The first thing we need to do is create the upload_to_pinecone function, which takes in the cleaned-up DataFrame returned from the preprocess_dataframe function and the vector_store variable, which we define in the main function. Within the function, we create a texts variable and set it to a list of all the records within the title_by_authors field. We then create a metadatas variable and make it so it creates a dictionary for the metadata of a particular row. In our case, we are only passing each record from the document_id field in for metadata, but this way allows us to add more metadata in the future if desired, such as the date_created. Then, using the langchain_pinecone integration library, we simply run the .add_texts method on our vector_store, passing through our texts and metadatas variables as arguments, which commences the embedding and uploading of each of the metadata entries that have been downloaded and preprocessed.

2.5 | Logging & The Main Function

Show Code
def setup_logging():
    logging.basicConfig(filename='arxiv_download.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def main():
    setup_logging()
    from_date = '2024-04-01'
    until_date = '2024-04-30'
    download_metadata(from_date, until_date)
    xml_file = 'arXiv_metadata_raw.xml'
    df = parse_xml_to_df(xml_file)
    os.remove(xml_file)
    df = preprocess_dataframe(df)
    if not df.empty:
        PINECONE_API_KEY = os.getenv('PINECONE_API_KEY')
        embedding_model = HuggingFaceEmbeddings()
        index_name = "arxiv-metdata"
        vector_store = PineconeVectorStore.from_existing_index(index_name=index_name, embedding=embedding_model)
        upload_to_pinecone(df, vector_store)
    else:
        logging.error("DataFrame is empty. Skipping upload.")

The main function is where this whole metadata downloading, preprocessing, and uploading pipeline comes to life. This function determines the flow of this entire pipeline, calling all the functions we have made, passing through the correct arguments to those functions and defining global variables referred to in the functions above. The main function is run when running the script.

The first thing we do is run the setup_logging() function which just provides logging information in its own separate .log file which tells us information about any tracebacks or errors received and the status of critical processes and branching actions within the application to provide a bit of an insight into what went, wrong, what went right, and why. You’ll see logging.info() functions dispersed throughout the code at these aforementioned points. Next, we define our from_date and until_date which are the variables used to define the date range when downloading the metadata. We then run the download_metadata function passing through our from_date and until_date as arguments.

Next, we define the xml_file variable and set it to the arXiv_metadata_raw.xml file which was downloaded, and then we pass the xml_file through to the parse_xml_to_df function setting the returned value to the df variable, perform some local cleanup by removing the xml_file, and then pass through the df to the preprocess_dataframe function. Lastly, we check to make sure that the DataFrame is not empty, and if its not, we set our PINECONE_API_KEY to the API key stored in the environment variables, set the embedding_model to the HuggingFaceEmbeddings() object provided by LangChain, set the index_name variable to the name of the index we will be uploading our vectors to, in this case arxiv-metadata, create the vector_store variable setting it to our vector store by initializing our PineconeVectorStore with the from_existing_index method, through which we pass the index_name and embedding_model as arguments. Lastly, we run our upload_to_pinecone function passing through both the finalized DataFrame and the vector store as arguments.

3.0 | Conclusion & Next Steps

Now that we have uploaded all of the downloaded and cleaned metadata, it’s time to move on to part two of this project, where we begin to create all the functions and code for a real, working prototype of the project.

Part two will be uploaded in a few weeks’ time. Until then, thank you for taking the time to read through this post. I look forward to seeing you again soon!

References

arXiv. 2024. “Bulk Data Access - arXiv.org.” https://info.arxiv.org/help/bulk_data.html.
Tol, Jack. 2024. “An Introduction to Retrieval Augmented Generation (RAG) in Large Language Models.” https://blog.jacktol.net/posts/an_introduction_to_rag_in_llms/.
University of Auckland Research Hub. 2024. “How Do i Publish in a Journal?” https://research-hub.auckland.ac.nz/the-publishing-process/how-do-i-publish-in-a-journal.