If you've been following the world of AI, you've likely heard of Retrieval Augmented Generation (RAG). It's a fantastic framework that helps Large Language Models (LLMs) provide more accurate and up-to-date information by retrieving relevant documents before generating a response. However, the initial, or "vanilla," RAG setup often falls short in addressing crucial aspects that impact the quality of both retrieval and answer generation. This is where advanced RAG comes in, offering powerful optimisations to take your AI applications to the next level.
Vanilla RAG Challenges
- ❌ Irrelevant retrieved documents
- ❌ Insufficient context
- ❌ Redundant information noise
- ❌ High latency
- ❌ Failed answer generation
Advanced RAG Solutions
- ✅ Smart pre-retrieval optimization
- ✅ Enhanced embedding models
- ✅ Intelligent filtering & ranking
- ✅ Optimized performance
- ✅ Reliable answer generation
Advanced RAG focuses on optimising the RAG pipeline at three distinct stages:
1. Pre-retrieval
Data preparation & query optimization
2. Retrieval
Enhanced search & filtering
3. Post-retrieval
Refinement & answer generation
1. Pre-retrieval: Setting the Stage for Success
This initial stage is all about preparing your data and refining user queries before the actual retrieval happens. It's a crucial step that can significantly impact the relevance of what's ultimately retrieved.
Data Indexing Optimisations
This involves structuring and preprocessing your data within the RAG ingestion pipeline, typically within the cleaning or chunking modules, for better indexing.
- Sliding Window: This technique introduces overlap between text chunks, ensuring that important context near chunk boundaries is retained, which boosts retrieval accuracy. It's especially useful in fields like legal or medical documents where information often spans sections.
- Enhancing Data Granularity: This is about making your dataset cleaner and more accurate by removing irrelevant details, verifying facts, and updating outdated information, leading to sharper retrieval.
- Metadata: Adding tags like dates, URLs, or chapter markers can help filter results efficiently during retrieval.
- Optimising Index Structures: This includes using various chunk sizes and multi-indexing strategies.
- Small-to-Big: An innovative algorithm that decouples the chunks used for retrieval from the context used for final answer generation. It uses a small text sequence for embedding (reducing noise and multiple topics in the embedding) while preserving a wider window of context in the metadata for the LLM.
Sliding Window Implementation
# Example: Sliding window chunking
def create_sliding_chunks(text, chunk_size=1000, overlap=200):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
chunks.append(chunk)
start = end - overlap # Overlap with previous chunk
return chunks
# Usage
document = "Your long document text here..."
chunks = create_sliding_chunks(document, chunk_size=1000, overlap=200)
print(f"Created {len(chunks)} chunks with overlap")
Query Optimisation
These techniques are applied directly to the user's query before it's embedded and used to retrieve chunks.
- Query Routing: This intelligent technique decides which action to take based on the user's input, much like an
if/elsestatement, but using natural language. For instance, it can determine if context is needed from a vector database, an SQL database, or even the internet via API calls. It can also select the best prompt template for a given input. - Query Rewriting: Sometimes, a user's initial query might not perfectly align with how your data is structured. Query rewriting reformulates the question to better match the indexed information. This can involve paraphrasing (e.g., "causes of climate change" to "factors contributing to global warming"), synonym substitution, breaking down longer queries into sub-queries, or even using Hypothetical Document Embeddings (HyDE) where an LLM creates a hypothetical response that's then fed with the original query into the retrieval stage.
- Query Expansion: This enriches the user's question by adding additional terms or concepts, providing different perspectives for the search (e.g., searching "disease" and also including "illnesses" or "ailments").
- Self-query: This maps unstructured queries into structured ones, where an LLM identifies key entities (like cities) within the input text to use as filtering parameters, thereby reducing the vector search space.
It's worth noting that pre-retrieval optimisations are highly dependent on your specific data type, structure, and source, making experimentation key to finding what works best.
2. Retrieval: Enhancing the Search for Information
The retrieval step itself can be made much more effective by focusing on two core areas: improving embedding models and leveraging database features. The ultimate goal here is to enhance the vector search step by improving semantic similarity between the query and indexed data.
Improving Embedding Models
- You can fine-tune pre-trained embedding models to better understand the specific jargon and nuances of your domain, especially for evolving or rare terminology.
- Alternatively, using instructor models can guide the embedding generation process with specific instructions tailored to your domain. This can be a good option as it's less resource-intensive than full fine-tuning.
Leveraging Database Filter and Search Features
- Hybrid Search: This combines vector search with keyword-based search. While vector search excels at semantic similarities, keyword search is superb for pinpoint accuracy with exact matches. By blending them, you get the best of both worlds, often controlled by a parameter called
alpha. - Filtered Vector Search: This uses metadata indexes to filter for specific keywords either before or after the vector search, narrowing down the search space.
Hybrid Search Architecture
🔤 Keyword Search
Exact matches & precision
Finds: Exact phrase matches
🧠 Vector Search
Semantic similarity
Finds: "AI", "neural networks", "deep learning"
Best of both worlds
In practice, starting with filtered vector search or hybrid search is common due to their quicker implementation, allowing you to adjust your strategy and fine-tune your embedding model if needed.
3. Post-retrieval: Refining What's Been Found
Once the data has been retrieved, post-retrieval optimisations ensure that the LLM's performance isn't compromised by issues like limited context windows or noisy, irrelevant information.
- Prompt Compression: This method aims to eliminate unnecessary details from the retrieved data while preserving its core essence, making the prompt more concise for the LLM.
- Re-ranking: A powerful technique where a cross-encoder ML model is used to assign a matching score between the user's input and each retrieved chunk. The retrieved items are then sorted by this score, and only the top N most relevant results are kept. This is computationally more intensive than initial similarity search but highly effective at identifying complex relationships, thus it's applied as a refinement step after the initial retrieval.
The Bottom Line
Advanced RAG isn't just a buzzword; it's a suite of powerful techniques designed to significantly enhance the RAG algorithm across its three key stages. By strategically preprocessing data, intelligently adjusting user queries, refining embedding models, using smart filtering, and cleaning up retrieved information, you can build a RAG workflow that delivers more accurate, relevant, and efficient responses from your LLMs. Remember, the best approach will always depend on your specific data and use case, so embrace experimentation!
🎯 Key Takeaways
Pre-retrieval
Optimize data & queries before search
Retrieval
Enhance embedding models & search
Post-retrieval
Refine & compress results