🤖 Introduction: The Problem With Vanilla LLMs
Large language models like GPT-4, Claude, and Gemini are impressive. They can write code, explain concepts, and reason about problems. But ask them about your company's internal docs, your product database, or yesterday's meeting notes, and they'll confidently make things up.
That's not because the models are bad. It's because they don't have your data. They were trained on public internet text, not your specific knowledge base.
RAG fixes this. It's a pattern that retrieves relevant information from your own documents and feeds it to the LLM as context, so the model answers based on actual facts instead of guessing.
I've been building RAG systems for internal tools at Noisiv Consulting, and this article walks through how it works in practice — not the research paper version, the "I need to ship this" version.
🧩 How RAG Works (The Simple Version)
The core idea is straightforward:
User asks a question
↓
Search your documents for relevant chunks
↓
Combine the question + relevant chunks into a prompt
↓
Send to LLM
↓
LLM answers based on your actual data
That's it. The magic is in how well you do each step.
The Architecture
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ User Query │ → │ Embedder │ → │ Vector DB │
│ "How do I │ │ (OpenAI / │ │ (Pinecone / │
│ deploy?" │ │ Cohere) │ │ ChromaDB) │
└──────────────┘ └──────────────┘ └──────┬───────┘
│
Top K relevant chunks
│
┌──────────────┐ ┌──────────────┐ ┌──────┴───────┐
│ Response │ ← │ LLM │ ← │ Prompt │
│ "To deploy, │ │ (GPT-4 / │ │ Builder │
│ run..." │ │ Claude) │ │ │
└──────────────┘ └──────────────┘ └──────────────┘
📄 Step 1: Prepare Your Documents
Before anything, you need to get your data into a format the system can work with. This means:
- Collect your documents (PDFs, Markdown, HTML, database records)
- Clean them (remove headers, footers, navigation elements)
- Chunk them into smaller pieces
Chunking Strategy
Chunking is where most people get it wrong. Too small and you lose context. Too large and you waste token budget with irrelevant information.
// Simple but effective chunking
function chunkDocument(text, options = {}) {
const {
chunkSize = 500, // characters per chunk
overlap = 50, // overlap between chunks
} = options;
const chunks = [];
let start = 0;
while (start < text.length) {
let end = start + chunkSize;
// Don't cut in the middle of a sentence
if (end < text.length) {
const lastPeriod = text.lastIndexOf(".", end);
if (lastPeriod > start + chunkSize * 0.5) {
end = lastPeriod + 1;
}
}
chunks.push({
content: text.slice(start, end).trim(),
startIndex: start,
endIndex: end,
});
start = end - overlap;
}
return chunks;
}
Chunking Rules of Thumb
🧮 Step 2: Generate Embeddings
Embeddings convert text into numerical vectors that capture meaning. Similar texts produce similar vectors, which is how we find relevant chunks later.
import OpenAI from "openai";
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
async function getEmbedding(text) {
const response = await openai.embeddings.create({
model: "text-embedding-3-small", // Fast, cheap, good enough
input: text,
});
return response.data[0].embedding; // Array of 1536 floats
}
// Embed all chunks
async function embedChunks(chunks) {
const embedded = [];
for (const chunk of chunks) {
const vector = await getEmbedding(chunk.content);
embedded.push({
...chunk,
vector,
});
}
return embedded;
}
Which Embedding Model?
For most use cases, text-embedding-3-small is the best starting point. It's fast, cheap, and good enough.
🗄️ Step 3: Store in a Vector Database
Vector databases are optimized for similarity search — finding the vectors closest to a query vector.
Using ChromaDB (Simple, Local)
import { ChromaClient } from "chromadb";
const chroma = new ChromaClient();
const collection = await chroma.createCollection({ name: "docs" });
// Store chunks with embeddings
await collection.add({
ids: chunks.map((_, i) => `chunk-${i}`),
documents: chunks.map((c) => c.content),
embeddings: chunks.map((c) => c.vector),
metadatas: chunks.map((c) => ({
source: c.source,
section: c.section,
})),
});
// Query: find relevant chunks
const results = await collection.query({
queryEmbeddings: [queryVector],
nResults: 5,
});
Using Pinecone (Production-Scale)
import { Pinecone } from "@pinecone-database/pinecone";
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const index = pinecone.index("docs");
// Upsert chunks
await index.upsert(
chunks.map((chunk, i) => ({
id: `chunk-${i}`,
values: chunk.vector,
metadata: {
content: chunk.content,
source: chunk.source,
},
}))
);
// Query
const results = await index.query({
vector: queryVector,
topK: 5,
includeMetadata: true,
});
Which Vector DB?
If you're already using PostgreSQL, pgvector is the easiest path. No new infrastructure to manage.
💬 Step 4: Build the Prompt
This is where you combine the user's question with the retrieved context and send it to the LLM.
function buildRAGPrompt(question, relevantChunks) {
const context = relevantChunks
.map((chunk, i) => `[Source ${i + 1}]: ${chunk.content}`)
.join("\n\n");
return `You are a helpful assistant that answers questions based on the provided context.
If the context doesn't contain enough information to answer the question,
say "I don't have enough information to answer that."
Do NOT make up information. Only use what's in the context below.
---
Context:
${context}
---
Question: ${question}
Answer:`;
}
The Complete RAG Function
async function askRAG(question) {
// 1. Embed the question
const queryVector = await getEmbedding(question);
// 2. Find relevant chunks
const results = await collection.query({
queryEmbeddings: [queryVector],
nResults: 5,
});
const relevantChunks = results.documents[0].map((doc, i) => ({
content: doc,
score: results.distances[0][i],
}));
// 3. Build prompt with context
const prompt = buildRAGPrompt(question, relevantChunks);
// 4. Get LLM response
const response = await openai.chat.completions.create({
model: "gpt-4",
messages: [{ role: "user", content: prompt }],
temperature: 0.3, // Lower = more factual, less creative
max_tokens: 1000,
});
return {
answer: response.choices[0].message.content,
sources: relevantChunks.map((c) => c.content.slice(0, 100) + "..."),
};
}
🛡️ Step 5: Reduce Hallucinations
RAG doesn't eliminate hallucinations, but you can minimize them:
Practical Techniques
- Lower temperature (0.1–0.3) for factual Q&A
- Explicit instructions in the system prompt: "Only answer based on the provided context"
- Return sources alongside answers so users can verify
- Set confidence thresholds: if retrieved chunks have low similarity scores, say "I'm not sure"
- Chunk quality matters more than quantity: 3 highly relevant chunks beat 10 vaguely related ones
// Confidence check before answering
const MIN_SIMILARITY = 0.75;
const relevantChunks = results.filter(
(r) => r.score >= MIN_SIMILARITY
);
if (relevantChunks.length === 0) {
return {
answer: "I don't have enough relevant information to answer this question confidently.",
sources: [],
};
}
🏗️ Production Considerations
Keep Your Index Fresh
Documents change. Your RAG pipeline needs to handle updates:
// Watch for document changes and re-index
async function reindexDocument(docId) {
// Delete old chunks
await collection.delete({ where: { docId } });
// Re-chunk and re-embed
const doc = await fetchDocument(docId);
const chunks = chunkDocument(doc.content);
const embedded = await embedChunks(chunks);
// Insert updated chunks
await collection.add({
ids: embedded.map((_, i) => `${docId}-${i}`),
documents: embedded.map((c) => c.content),
embeddings: embedded.map((c) => c.vector),
});
}
Cost Estimation
For a knowledge base with 10,000 documents (~5M tokens total):
Total for 1,000 queries/day: ~$100-200/month. Very affordable for the value it provides.
🎯 Conclusion
RAG isn't complicated. It's four steps: chunk your docs, embed them, store them, retrieve and prompt. The hard part is doing each step well — good chunking, the right similarity threshold, and a prompt that keeps the model honest.
If you're building any kind of AI assistant, internal search, or knowledge system, RAG is the pattern you want. It turns a generic LLM into something that actually knows your stuff.
Start with ChromaDB and text-embedding-3-small. You can have a working prototype in an afternoon.
📚 Key Takeaways
- RAG = Retrieve + Augment + Generate — ground LLM answers in your actual data
- Chunking quality determines retrieval quality — respect sentence boundaries
- pgvector is the easiest option if you already use PostgreSQL
- Lower temperature and explicit prompts reduce hallucinations significantly
- Return sources alongside answers so users can verify the response
- Start simple, measure retrieval quality, then optimize
