RAG Systems: Enhancing LLMs with Custom Data
Retrieval-Augmented Generation (RAG) is transforming how we work with Large Language Models by combining their reasoning capabilities with custom, up-to-date information from your own data sources.
Understanding RAG
RAG enhances LLMs by:
- Providing access to custom knowledge bases
- Reducing hallucinations with factual grounding
- Enabling real-time information updates
- Maintaining data privacy and control
Architecture Overview
A typical RAG system consists of:
- Document Processing: Chunking and embedding
- Vector Store: Similarity search database
- Retrieval: Finding relevant context
- Generation: LLM produces answer with context
Building a RAG System
Step 1: Document Processing
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter"; const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000, chunkOverlap: 200, }); const chunks = await splitter.createDocuments([documentText]);
Step 2: Create Embeddings
import { OpenAIEmbeddings } from "@langchain/openai"; const embeddings = new OpenAIEmbeddings({ modelName: "text-embedding-3-small", });
Step 3: Store in Vector Database
import { Pinecone } from "@pinecone-database/pinecone"; const pinecone = new Pinecone(); const index = pinecone.Index("knowledge-base"); await index.upsert( chunks.map((chunk, i) => ({ id: `doc-${i}`, values: embeddings.embedQuery(chunk.pageContent), metadata: { text: chunk.pageContent } })) );
Step 4: Retrieval Chain
import { RetrievalQAChain } from "langchain/chains"; import { ChatOpenAI } from "@langchain/openai"; const llm = new ChatOpenAI({ modelName: "gpt-4" }); const vectorStore = await PineconeStore.fromExistingIndex(embeddings); const chain = RetrievalQAChain.fromLLM( llm, vectorStore.asRetriever() ); const response = await chain.call({ query: "What is the company's refund policy?" });
Advanced Techniques
Hybrid Search
Combine semantic and keyword search:
const results = await vectorStore.similaritySearch(query, 5); const keywordResults = await fullTextSearch(query); const combined = rerank([...results, ...keywordResults]);
Re-ranking
Improve result relevance:
import { CohereRerank } from "@langchain/cohere"; const reranker = new CohereRerank(); const reranked = await reranker.rerank(results, query);
Multi-Query Retrieval
Generate multiple perspectives:
const queries = await llm.call({ prompt: `Generate 3 different versions of this question: ${query}` }); const allResults = await Promise.all( queries.map(q => vectorStore.similaritySearch(q)) );
Best Practices
- Chunk Size: 500-1000 tokens works well
- Overlap: 10-20% overlap between chunks
- Metadata: Store source, date, author
- Hybrid Search: Combine semantic + keyword
- Reranking: Use cross-encoder models
Production Considerations
Caching
const cache = new Redis(); async function getCachedResponse(query: string) { const cached = await cache.get(query); if (cached) return cached; const response = await chain.call({ query }); await cache.setex(query, 3600, response); return response; }
Monitoring
Track retrieval quality:
async function logRetrieval(query: string, results: Document[]) { await analytics.track({ event: 'rag_retrieval', query, numResults: results.length, avgScore: results.reduce((a, r) => a + r.score, 0) / results.length }); }
Conclusion
RAG systems unlock the full potential of LLMs by grounding them in your custom data. Start with a simple implementation and iterate based on user feedback and retrieval metrics.