Note

This blog post is available in audio format as well. You can listen using the player below or download the mp3 file for listening at your convenience.

Quick Tip!

If you’re already familiar with the basics of language models, feel free to skip ahead straight to Section 2: Introducing Retrieval Augmented Generation (RAG).

1.0 | Preliminary Knowledge: Language Models

1.1 | What is a Language Model?

A language model is a computational model designed to understand, generate, and parse human language.

Language models serve as the backbone for various natural language processing (NLP) tasks, such as:

Text Generation
Text Translation
Text Summarization
Sentiment Analysis

These probabilistic models are trained on extensive collections of text data, referred to in the industry as a corpora of text data, in order to learn the statistical patterns and structures of language and predict the likelihood of any given word following a preceding sequence of words.

The language model manages to capture the syntactic, semantic, and contextual intricacies of language, allowing them to generate coherent and contextually relevant text.

Simplistic Example

For instance, suppose we have a language model trained on a corpus of movie scripts.

Given the prompt “The hero looked into the distance and saw”, the language model will calculate the probability, \(P\), of the next token being one of the following options from its vocabulary.

“a dragon”
“an army of enemies”
“a beautiful sunset”
“nothing but darkness”

The language model assigns a probability to each option based on the patterns it has learned from the training data.

Next Token Probabilities

\[ \begin{aligned} &P(\text{next token} = \text{"an army of enemies"}| \space \text{prompt} \ \& \ \text{the training data}) = 0.40 \\ &P(\text{next token} = \text{"nothing but darkness"}| \space \text{prompt} \ \& \ \text{the training data}) = 0.25 \\ &P(\text{next token} = \text{"a dragon"}| \space \text{prompt} \ \& \ \text{the training data}) = 0.20 \\ &P(\text{next token} = \text{"a beautiful sunset"}| \space \text{prompt} \ \& \ \text{the training data}) = 0.15 \\ \end{aligned} \]

In this example, the model, based on the prompt and the training data, will use “an army of enemies” as the token most likely to follow the prompt “The hero looked into the distance and saw” and will return to the user the concatenated output: “The hero looked into the distance and saw an army of enemies.

Note

In the example above, the term “token” refers to either a whole word or a part of a word. However, I used a phrase to represent the token for illustrative purposes.

Jargon Alert!

The collection of all unique words used in a language model’s training data is known as the model’s vocabulary.

1.2 | Transfer Learning in Language Models

Transfer learning plays a significant role in the development of language models, allowing them to leverage knowledge from pre-trained models to improve performance on similar, yet different tasks than what the model was originally designed for.

In the context of language models, transfer learning typically involves pre-training on large corpora of text data followed by fine-tuning on task/domain specific data.

Key Concepts in Transfer Learning

Pre-training: Involves training a language model on diverse corpora of text data to learn general language representations.
Fine-tuning: Involves adapting the pre-trained language model to a specific task or domain by further training it on task-specific datasets.

1.3 | The Attention Mechanism in Transformer Models

The attention mechanism in Transformer-based language models serves as a fundamental technique for establishing connections between tokens within the same sentence.

By considering the positional placement of each token, this mechanism evaluates two critical aspects: the word embedding, which encapsulates its semantic meaning, and the positional encoding, which distinguishes relationships between nearby and distant words.

Through this process, the attention mechanism enhances the word embeddings to encompass not only the word’s context but also the contextual information of neighboring words. This mechanism empowers Transformer models to effectively capture intricate dependencies and contextual nuances across the input sequence.

Understanding Attention

The attention mechanism consists of queries, keys, and values, allowing the model to weigh the importance of different input tokens when generating representations.
The Transformer Architecture typically employs multi-head attention, where multiple attention heads are used to capture diverse aspects of the input sequence simultaneously. Each head processes parallel computations of self-attention, enabling a comprehensive analysis of various relationships within the data.
To account for the sequential nature of the input, positional encodings are added to the input embeddings to provide information about a tokens position.

1.4 | Prompt Engineering

Prompt engineering is a powerful technique that can be used to leverage language models to perform tasks they were not specifically trained for.

Prompt engineering can be thought of as a form of local, session-based fine-tuning. It involves providing explicit instructions to the model on the task it needs to perform, followed by examples demonstrating how to execute the task.

Fine-tuning, as discussed previously, typically involves subsequent training of a pre-trained model with task or domain-specific information to improve its performance. In this case, by crafting prompts, you’re essentially tailoring the input to guide the model’s responses toward desired outcomes or styles for a particular session or interaction.

The two primary applications of prompt engineering include task adaptation and question answering. In the following two sub-sections, we briefly cover each one.

1.4.1 | Task Adaptation

Task adaptation involves providing the language model with explicit instructions and examples to execute novel tasks, even if they weren’t part of its initial training data. By carefully designing prompts and offering relevant illustrations, the model can effectively adapt and perform tasks it hasn’t encountered before.

Task Adaptation Example

System Prompt

“Your job is to transform text to replace names with specific placeholders. For example, if the text mentions ‘John’ or ‘James’, replace them with ‘Person’.”

User Prompt

“John went to the store. Jane likes to read books.”

Model Output

“Person went to the store. Person likes to read books.”

1.4.2 | Question Answering

A slight extension of task adaptation leads us to question answering. This essentially allows the model to extract information from provided contexts and generate accurate responses to posed questions.

This is usually done by providing a somewhat templated input to the model, first by providing some system-level instructions to the model, then providing some relevant context, which is then followed by a question.

By creating structured prompts with instructions, important contextual details, and a question, the model can respond in an accurate and novel way, even to a question it has never seen before.

Question Answering Example

Instruction

“Provide information about the capital city of France.”

Context

“France is a country located in Western Europe known for its rich history and culture. It has been a significant center of art, fashion, and gastronomy.”

Question

“What is the capital city of France?”

Model Output

“The capital city of France is Paris, which is renowned for its iconic landmarks such as the Eiffel Tower and Louvre Museum, contributing to its status as a significant center of art, fashion, and gastronomy.”

2.0 | Introducing Retrieval Augmented Generation (RAG)

2.1 | So What Is RAG?

The best way to understand RAG and why it is such a powerful technique is to understand one of the big limitations of language models. At a high level, the steps involved when working with any machine learning model broadly include:

Collect The Data
Train the model
Use The Trained Model For Inference

A model is considered ‘trained’ after it has been found to perform reasonably well, with respect to the pre-defined training metric on data it has never seen before. As you may have noticed, due to the implementation of the training pipeline, it is required to collect and organize your data before beginning the training process. This, however, comes with an implicit limitation on the currency of the data on which the model has been trained.

For many applications though, this does not pose an issue due to the broad and general nature of the problem domains which are being worked on. For example, suppose we had an image classification model fine-tuned to classify tumors as being either malignant or benign, we can be fairly certain that the nature of malignant or benign tumors won’t really change from year to year. However, when users are interacting with a language model, by asking questions, fact-checking information, and learning new things it is more important than ever to make sure that the language model is up to date with the latest concepts, ideas, perspectives and information.

RAG is simply a technique that enhances language models by incorporating retrieved accurate and up-to-date information from an authoritative external knowledge base. This is combined with prompt engineering techniques, which enables the generation of high-quality, current, and accurate responses, even on questions the model has never encountered before, while also reducing the occurrence of model-generated errors, also known as ‘hallucinations’.

Jargon Alert!

In language models, a hallucination occurs when the model generates content that appears credible but is actually false or unrelated to the input. This typically results from training inconsistencies or data limitations.

2.2 | How does RAG Work?

Before getting into the details, I first want to introduce a high-level sequential approach to the pipeline in order to garner a bit more intuition for the steps and processes to follow.

Let’s consider the following diagram:

flowchart TB
    A1(["User Query"]) -->|"User query"| B4["Embedding Model"]
    B4 -->|"Embedded user query for searching"| B6["Vector Database"]
    A1 -->|"User query"| B8["Context + User Query Added To Prompt Template"]
    A2(["External Documents"]) -->|"Documents (PDFs, TXTs, CSVs, etc.)"| B1["Document Loader"]
    B1 -->|"Content of each page"| B2["Document Splitter"]
    B2 -->|"Sentences after heuristic-based splitting"| B3["Document Chunker"]
    B3 -->|"Document chunks"| B5["Embedding Model"]
    B5 -->|"Chunks of combined sentences ready for storing"| B6
    B6 -->|"Top-k chunks based on similarity search with respect to the user query"| B7["Chunks Concatenated Into Context"]
    B7 -->|"Context"| B8
    B8 -->|"Prompt Template"| B9["Inference Model"]
    B9 --> B10(["Output"])

Figure 1: High-Level Overview of the RAG Pipeline

The first thing we notice is that there are two different starting points we can consider when understanding the RAG Pipeline: the User Query perspective and the External Documents perspective.

Starting with the External Documents perspective, we see the first step after locating/collecting our external documents is to preprocess the text data. This generally involves passing the documents through a library such as spaCy or LangChain to load the documents, perform heuristic or semantic-based splitting and chunking, and then passing our document chunks through to an embedding model, which transforms each chunk into a vector of embeddings ready for storage in our vector database.

Note

The number of dimensions of the vectors containing the document chunks’ embeddings will vary depending on the embedding model you use. For example, the Sentence Transformers model will embed your chunks to 768 dimensions, while OpenAI’s text-embedding-3-large model will embed your chunks to 3072 dimensions.

Let’s now examine the pipeline from the User Query perspective. When a user inputs a query of any sort, the prompt is immediately passed through to an embedding model. It is crucial that this be the same embedding model used for the document chunks to ensure that the query vector and the chunk vectors are compatible. This compatibility is essential for conducting a similarity search on them in future steps.

2.3 | Text Embeddings

When a language model processes natural human language, the initial step involves passing the text through a tokenizer. A tokenizer is a specialized model trained to break down a sentence into smaller components, typically including words or parts of words.

After successfully tokenizing the input text, the next step is to process these tokens using an embedding model. An embedding model is designed to convert each token into a numerical vector, intelligently capturing the semantic, syntactic, and contextual nuances of the token.

Building on the concept of token embeddings, sentence embeddings assume that the semantic, syntactic, and contextual attributes of a sentence can be represented as the aggregate of its individual tokens. To achieve this, sentences are input into a sentence embedding model, utilizing a technique known as mean pooling. This process aggregates the embedding vectors of each token and outputs a single composite embedding that reflects the overall semantic, syntactic, and contextual intricacies of the sentence.

It is possible to apply the same principles of sentence embeddings to a chunk of text, which combines two or three sentences together into a mini-paragraph of sorts. This is the main idea behind how we are going to store our external documents in a vector database.

This is why the first step of our RAG pipeline is to split our documents into individual sentences using a splitting heuristic, such as defining a sentence as ending once we encounter a period followed by a space. It’s important to notice here that different types of text data will be written with different syntax than others, so the heuristic we choose for text-splitting will change from use case to use case.

Fortunately, libraries such as LangChain offer a wide variety of built-in text-splitters, such as splitting on HTML-specific characters, Markdown-specific characters, code (with up to 15 different languages to choose from), user-defined characters, and more.

Note

A complete list of LangChain’s text-splitting options can be found within the LangChain documentation (“Text Splitters” 2024). See the references section of this blog for more information.

When processing documents composed primarily of English syntax, LangChain offers a combined splitting and chunking tool, taken from Greg Kamradt, called the Semantic Chunker. This tool first splits the text into sentences, then passes each sentence through an embedding model, compares the embedding vectors of adjacent sentences, and combines the most similar ones. Instead of going through an inaccurate and laborious process of encoding and decoding the sentences, each sentence embedding retains a link to its original sentence through indexing, which ensures accurate identification and combination of the originally inputted sentences.

In the past, chunking was done more robotically and mechanically by simply going through the document from top to bottom and combining three or four sentences together into chunks. While this method is faster, simpler, and less computationally demanding, some of the semantic and contextual details will inevitably be lost in the chunking process.

Just as Greg Kamradt stated in his notebook, “5 Levels of Text-Splitting,” (Kamradt 2022),

“Isn’t it weird that we have a global constant for chunk size? Isn’t it even weirder that our normal chunking mechanisms don’t take into account the actual content? I’m not the only one who thinks so, there has to be a better way - let’s explore and find out.”

2.4 | Similarity Search in RAG

2.4.1 | Similarity Search as a Means of Retrieval

One of the most important and consequential elements of the RAG pipeline is the similarity search algorithm, which serves as the heuristic for retrieving relevant context. Let’s take some time to explore not only how this works but also why it is a critical component of the pipeline.

2.4.2 | Using Distance Between Vectors as a Similarity Metric

In the context of RAG and information retrieval, using the distance between vectors as a similarity metric is a fundamental concept. This approach leverages the idea that the smaller the distance between two vectors, the greater their similarity. Commonly, metrics such as Euclidean Distance or Cosine Similarity are utilized to quantify this relationship.

By measuring the distance or proximity in a multi-dimensional vector space, we can effectively gauge the degree of similarity between the contents they represent. This enables the accurate and efficient retrieval of the top-k most similar document chunks with respect to the user’s query, which will then serve as the context provided to the language model.

In this exploration, we will assume the use of Cosine Similarity as the similarity metric within our RAG pipeline.

2.4.3 | The Role of Vector Normalization in Cosine Similarity

When comparing vector-based similarity in the retrieval process, one might be tempted to simply use the dot product between vectors. However, the mere dot product can be misleading as it is influenced by the magnitude of the vectors, not just the angle between them. This is where vector normalization becomes crucial to ensure the range of similarity scores remains between -1 and 1, indicating the cosine of the angle between them.

Consider the following Code Example 1:

import torch

def cosine_similarity(vector1, vector2):
    return torch.nn.functional.cosine_similarity(vector1.unsqueeze(0),
    vector2.unsqueeze(0)).round()[0].item()

# Example Vectors
vector1 = torch.tensor([1, 2, 3], dtype=torch.float32)
vector2 = torch.tensor([1, 2, 3], dtype=torch.float32)
vector3 = torch.tensor([-1, -2, -3], dtype=torch.float32)

# Calculate Dot Products & Cosine Similarities
dot_product_v1_v2 = torch.dot(vector1, vector2)
cosine_sim_v1_v2 = cosine_similarity(vector1, vector2)
dot_product_v1_v3 = torch.dot(vector1, vector3)
cosine_sim_v1_v3 = cosine_similarity(vector1, vector3)

Figure 2: Code Example 1: Demonstrating the Differences Between Dot Product and Cosine Similarity as Similarity Metrics

Output for Code Example 1

Description	Value
Dot Product Between Vector1 and Vector2	14.0
Cosine Similarity Between Vector1 & Vector2	1.0
Dot Product Between Vector1 & Vector3	-14.0
Cosine Similarity Between Vector1 & Vector3	-1.0

Inspecting the output, we can see that while the dot product gives us the raw scalar product of the vectors, the cosine similarity function provided by torch.nn.functional further normalizes these vectors. This normalization accounts for the lengths of the vectors and adjusts the similarity measure to focus solely on the directionality and orientation of the vectors relative to each other.

This distinction is vital, especially in high-dimensional spaces where the magnitude of vectors can vastly differ. If raw dot products were used, it could potentially skew the similarity measures. Therefore, normalized cosine similarity provides a more accurate and meaningful metric for evaluating the similarity between a user’s query and stored document chunks in the context of our RAG pipeline.

Fortunately, text embedding models from companies like OpenAI, Cohere, and SentenceBERT typically perform this normalizing step for us automatically as part of the embedding process. This means that the vectors of embeddings are already in the format required for conducting similarity searches.

Note

Code Example 1 is adapted from Daniel Bourke’s notebook titled “Create and Run a Local RAG Pipeline from Scratch”(Bourke 2024).

2.4.4 | Calculating The Distance Between Vectors

Before diving into the specifics of the formula, it’s important to understand the fundamental concept of measuring the similarity between vectors in a space.

At a high level, the cosine similarity evaluates the cosine of the angle between two vectors, providing a metric that assesses how closely the directions of the two vectors align, regardless of their magnitude.

Consider the following equation:

\[ \text{cosine similarity} = \mathcal{S}_c(\mathbf{A}, \mathbf{B}):= \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\| \mathbf{A} \| \| \mathbf{B} \|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}} \]

The equation demonstrates how the normalization of vectors is integrated into the similarity measurement process.

By dividing the dot product of vectors \(\mathbf{A}\) and \(\mathbf{B}\) by the product of their magnitudes, we ensure that the similarity score reflects only the orientation of these vectors in space, irrespective of their lengths. This formula confirms that cosine similarity is a normalized metric, ideal for comparing the directional similarity between vectors in high-dimensional spaces. This functionality is precisely what we require when conducting a similarity search among vectors of embedded document chunks with respect to an embedded user query.

2.4.5 | Retrieving Document Chunks and Assembling Context

With this in mind, we can start to bring these pieces together to create the retrieval part of the pipeline. When a user inputs their query into the language model, we take their query and pass it through the same embedding model we used to embed our document chunks. Then, we perform a similarity search, using cosine similarity as the metric, against all the document chunk vectors with respect to the user query vector. We then simply return the top-k chunks with the highest similarity score. Once we have retrieved our chunks, we are ready to concatenate them into the context which we will be passing through to a prompt template, seamlessly continuing to the next part of the pipeline.

2.5 | Preparing Our Prompt Template

The importance of properly writing our prompt template cannot be overstated. It is crucial for the model to understand that we want it to respond to the user’s query using only the provided context. One of the major benefits and selling points of RAG is that we can trust the accuracy and reliability of the content the model generates because the knowledge base and external documents the model uses are controlled by us. Our aim is to combine a general, pre-trained base language model with the authoritative nature of our external documentation. This approach allows us to leverage the impressive reasoning, explaining, structuring, and formatting capabilities of a base language model alongside our reliable and accurate retrieved information.

Many different people have come up with their own prompt templates to suit their individual use cases, however a common template can look something like this:

Instructions: Your job is to answer the user’s query using only the provided context. Keep your answer grounded in the facts of the provided context. If the context does not contain the facts needed to answer the user’s query, return: “I do not have enough information available to accurately answer the question.”

Context: {context}

Query: {user_query}

2.6 | Choosing an Inference Model

The inference model we choose for our RAG application will significantly influence the quality of the responses. There are several factors to consider when selecting a language model for the pipeline. For instance, the context window size of the model determines how many functional chunks of retrieved information we can send to the model to answer the user’s question. This also affects the conversational aspect of the model. Ideally, a user wouldn’t just ask one question about one document, but rather would engage in a natural back-and-forth with the model. The model would remember previously retrieved chunks, past user queries, and previous answers to user queries to enable a rich and comprehensive discussion about the external documents. With this in mind, large language models from companies such as OpenAI, Anthropic, Google, and Meta are preferable. Although they come with a cost per token input and output, the benefits of high-quality responses and the improved short-term working memory of the models far outweigh the costs.

2.7 | And we’re done!

After passing our prompt template to the inference model, the response we receive will be a comprehensive, accurate, and reliable answer to the user’s query, based solely on the context retrieved from earlier steps in the pipeline.

The next blog post will be part one of a three-part series, exploring a unique personal project I’ve been working on. This project involves connecting the live stream of STEM research papers uploaded daily to arXiv to a RAG pipeline. This setup will aid students and enthusiasts in understanding the complex concepts, methods, and ideas that have been and continue to be discovered.

For links to all resources, notebooks, and documentation used in the creation of this blog, please see the References section.

References

Bourke, Daniel. 2024. “Simple Local Retriever-Augmented Generator.” https://github.com/mrdbourke/simple-local-rag/blob/main/00-simple-local-rag.ipynb.

Jamil, Umar. 2023. “Retrieval Augmented Generation (RAG) Explained: Embedding, Sentence BERT, Vector Database (HNSW).” YouTube video. https://www.youtube.com/watch?v=rhZgXNdhWDY.

Kamradt, Greg. 2022. “5 Levels of Text Splitting.” https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/8a30b5710b3dd99ef2239fb60c7b54bc38d3613d/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb.

“Text Splitters.” 2024. LangChain. 2024. https://python.langchain.com/docs/modules/data_connection/document_transformers/.