At Arcus, data is at the core of what we do. We automatically discover the right data for each AI application connected to our data enrichment platform. Our Prompt Enrichment offering allows users to discover and integrate relevant and valuable ground truth data and context from both internal and external data sources to augment their prompts for their generative AI applications. This is powered in-part by the information retrieval systems that underlie Retrieval Augmented Generation (RAG). At Arcus, we operate at one of the largest scales of RAG in the world to discover relevant context from our massive corpus of external data sources as well as customers’ internal data stores.
Through this process, we’ve naturally hit unique scaling limitations of Retrieval Augmented Generation (RAG) causing us to innovate against traditional approaches. This article deep dives into how we approach architecting text-based RAG implementations for Large Language Models (LLMs) in order to effectively navigate our planet-scale data retrieval scenarios.
LLMs are designed to understand and generate human-like natural language based on the patterns they have learned from large amounts of text. However, LLMs don't always have the necessary information and context to generate accurate, relevant or coherent responses. This is why Retrieval-Augmented Generation (RAG) techniques have been developed.
In the example above, GPT-3.5 responds correctly and coherently when RAG is used to retrieve and augment the initial prompt with additional context. Without employing RAG, the model relies on information encoded in its parameters from its pre-training and fine tuning processes. This method is a lossy way to store knowledge and is prone to saving stale information and hallucinating (i.e. "guessing") when it doesn’t have the right information it needs to form an accurate response.
By using RAG, the model can incorporate current contextual information, enabling it to craft accurate and factual responses. This is the major difference from using LLMs without RAG, as the technique equips the model to stay up-to-date with ground-truth context and allows it to generate responses by using and interpreting the context that is provided to the prompt.
Retrieval Augmented Generation (RAG) is a technique that combines concepts of Information Retrieval with LLMs. RAG is often implemented with embedding search, which uses an embedding model to convert text into vectors, a process called text embedding; these vectors are often then stored in a vector database. In these vectors, semantically similar content is represented closer together. For a given query, you can find the most relevant and similar information by looking up the vectors that are closest to your query; this ensures that the retrieved information is closest to your query in semantic meaning.
However, embedding models typically have a fixed context window, or the amount of text, that they can process at a time, so we can generate embeddings on snippets of text, not entire documents. To use RAG over a set of documents, we have to split (or “chunk”) each document into smaller pieces before converting (“embedding”) them into vectors, where now each embedding corresponds to a specific chunk of information from the original document. When we want to later retrieve information, we search for the K closest embeddings to our query to retrieve the K most relevant text chunks across all of our documents. The retrieved chunks are then used to augment our original prompt and fed to the LLM which processes both the original prompt and the context provided by the chunks to generate a response.
Let's imagine a scenario involving not just a handful of documents, but billions. Operating at this scale, we encounter a few challenges:
Incomplete Content Representation: Splitting a document into chunks for retrieval and augmentation results in each chunk containing only a minimal representation of the original document's content. This reduction in content leads to loss of context and losing important information during retrieval, since each chunk lacks a comprehensive understanding of the original document. This means that you may not find the chunk you need because each chunk of text is taken out of the context of the original document that it is from.
Inaccurate Chunk Similarity Search: With the volume of data, the amount of noise in each retrieval increases exponentially. You much more frequently get matched to incorrect data that just happened to be closer together. This amount of noise in the system at scale means that your entire retrieval system becomes fragile and unreliable.
For the reasons above and others, using a traditional RAG approach when searching through this volume of data results in poor performance and a lot of irrelevant context being retrieved, making it infeasible when dealing with RAG at scale.
At Arcus, our goal is to find and integrate the right data for every AI application that improves its performance. In order to effectively do this, we manage large volumes of data on our platform. Given the need to search through this volume of data to discover the most relevant context for a given LLM application, the conventional implementation of RAG falls short for the reasons mentioned above. To combat this, we architected a multi-tiered approach to retrieval.
The simplest way to conceptualize multi-tiered retrieval is to think of a basic two-tiered example. In a two-tiered RAG setting, we search for relevant chunks to augment our LLM in two stages. We represent our first tier as the set of documents in our corpus, and our second tier as the set of chunks within each document. Each document is represented as a summary vector that is computed from the vectors that represent the underlying chunks in the document.
When doing retrieval, we first determine the subset of documents that are relevant to our query (tier 1), and then a second search over the chunks from our subset of documents (tier 2). We ultimately return the top K most relevant chunks from the final embedding search. The two-tiered approach filters out much more irrelevant information when compared to conventional RAG approaches by making sure we only search over chunks from the documents most relevant to our query.
Now, instead of just two tiers, imagine having many tiers with each tier further refining our search space to a narrower set of data that is most relevant to our specific query. This allows us to further structure our retrieval over the documents.The diagram below shows an example four-tier system where we reduce our document search space in the first two tiers before searching through the remaining documents and then chunks.
One way to implement a multi-tiered search system is by extracting and sharding on semantic information. This semantic context represents the actual meaning of the text, from the underlying documents and chunks, i.e. this is an article about the US Open or this document details the financial risk indicators of Tesla. We can search for relevant context using this semantic information in a hierarchical fashion, in an approach we call semantic tiering. Each higher semantic tier can be represented by a summary vector or embedding that describes the semantic category.
With semantic tiering, we iteratively refine our search space to only the set of data that we know has semantically relevant content to our initial query. We first extract semantic information from the documents and chunks, and then organize a hierarchical structure that groups sets of documents that are all similar in meaning. In a three-tier system, the first tier represents some clusters or groups of documents – for example, we can use the first tier to narrow the search space into the topics and categories that the documents fall into, the second tier to search over the individual documents for the relevant topics or categories, and the final tier to search over the individual chunks for those documents. In a four-tier system, the first tier we search over would be clusters of clusters of documents or groups of topics or categories, and so on. This semantic tiering process can be extrapolated to reach the depth you need given the scale and diversity of the data you need to search over.
One of the most difficult challenges in designing the multi-tiered system is determining how and what semantic information to extract and how to most effectively cluster the documents to narrow the search space. The quality of your clustering and semantic context extraction will determine the quality of your search and retrieval. By intelligently selecting how and what you group together, you ensure the retrieval process uses the right context to narrow the search space and ensure the globally most relevant information is retrieved.
Let's walk through an example:
Let's imagine we have a corpus of documents consisting of all of the historical SEC 10-Q and 10-K filings (these are financial documents that public companies are required to submit to the Securities and Exchange Commission on quarterly and yearly bases, respectively) and want to use this corpus for our RAG application. Let’s walk through one way we could implement a four-tier retrieval system for RAG over the SEC-10Q and 10-K filings.
Let’s set up our tiers as the following:
Tier 1 - Industries
Tier 2 - Companies
Tier 3 - Individual SEC 10-Q and 10-K Filings
Tier 4 - Chunks of Filings
Let’s walk through how retrieval would work for a simple example query:
Query: “What were Microsoft’s revenue changes from 2001 to 2019? Were there any times where their revenue went down? What about Apple?”
Tier 1 – Industries
In Tier 1, the retrieval would filter our search space over industries to in this case focus on the technology sector.
Tier 2 – Companies
In Tier 2, we determine which companies within the sector are most relevant to the query: Microsoft and Apple.
Tier 3 – Individual SEC 10-Q and 10-K Filings
In Tier 3, we search over the documents for the filings from Microsoft and Apple, which will narrow down to the 10-Q and 10-K filings from 2001 to 2019.
Tier 4 – Chunks of Filings
In Tier 4, we search through the final chunks in our set of documents to find what is most relevant to the query. In this case, we retrieve the chunks that talk about revenue or revenue changes.
Finally, these chunks are returned to augment the LLMs’ prompt alongside the original query.
The example above illustrates a simple case of just SEC 10-Q and 10-K filings, but the tiering structure becomes significantly more complex as the scale of documents increases.
While we're encouraged with the progress we've made in architecting RAG for planet-scale data retrieval, we still have more work ahead of us. We’re continuing to innovate and work on:
We are hiring across the board! We’re working on exciting problems at the cutting-edge of Retrieval Augmented Generation (RAG), LLMs, Model Evals, Data Infrastructure, Data Processing and more. Check out our careers page or reach out to us at [email protected]!