· AI & Development · 5 min read
Building Ask Warren: A Production RAG System with 47 Years of Investment Wisdom
How I built a retrieval-augmented generation system that lets you chat with Warren Buffett's shareholder letters - to understand RAG trade-offs in production.

I’ve been curious about RAG (Retrieval-Augmented Generation) for a while. Reading about a technology and actually shipping it are very different. I wanted to feel the real friction—parsing, chunking, embeddings, latency, cost, quality—and see the upside. I like to think in trade-offs.
I had the perfect excuse.
I look up to Warren Buffett. I have a deep interest in finance, and I treat the Berkshire Hathaway shareholder letters like a personal curriculum—47 years of clear thinking on capital allocation, incentives, risk, and integrity. I’ve read them all. Multiple times. They’re dense with ideas you can actually use.
So I built Ask Warren—a production RAG system that answers questions from those letters, with sources and year citations, fast and cheap.
Why Warren Buffett?
I needed a corpus that’s:
- High-signal (no fluff; consistent voice and quality)
- Publicly available (PDFs from 1977–2023)
- Actually interesting to query (things you’d ask a real investor)
And, personally:
- I admire Buffett’s clarity and discipline.
- The letters are the primary source. No hot takes. No summaries. Just the original text.
- I’ve already internalized a lot of it; now I wanted a tool to interrogate and cross-reference it on demand.
Examples I wanted to explore:
- “What does Warren think about cryptocurrency?”
- “How should I think about market downturns?”
- “What makes a great manager?”
The Vision
Users ask questions in plain English. The system finds relevant passages from 47 years of letters. Gemini generates answers with explicit year/source citations. Cost: essentially free (~$0.000001 per query).
Understanding RAG (in practice)
Traditional LLM:
User: "What does Warren think about Bitcoin?"
LLM: ...makes up something plausible from training data...RAG:
User: "What does Warren think about Bitcoin?"
↓
1) Embed the question
2) Search 673 chunks from the letters
3) Take top 5 most relevant
4) Inject those chunks as context
5) Generate an answer that cites the years/sources
↓
"Based on his 2018 and 2020 letters, Warren said..."Why this matters: Grounded answers with receipts, not vibes.
Key Architecture Decisions
Decision 1: Bundle Embeddings in Lambda
Options for vector storage:
- Pinecone (managed): ~$70/mo + network latency
- S3 + load at runtime: $0 storage, but S3 latency (~100ms) per cold start
- Bundle with Lambda: $0, lowest latency, simplest
I chose bundled embeddings:
// 673 chunks × 768 dims × 4 bytes ≈ ~2 MB
// Way under Lambda's 250 MB unzipped limit
// No network hop → ~50ms search vs ~150ms with remote calls
// $0 infra for vectorsTrade-off: not for giant corpora. But for this dataset, it’s a perfect fit.
Decision 2: Gemini Over OpenAI (for this use case)
Gemini text-embedding-004
- Free tier (2M tokens/day at time of build)
- 768-dim vectors (smaller = faster, good enough quality)
- Works well for semantic retrieval
OpenAI text-embedding-3
- Low cost but not free
- 1536-dim vectors (larger payloads)
- Slightly better in some benchmarks
For this project: Free + fast > marginally better.
Decision 3: Chunk Size Matters (a lot)
After testing:
const CHUNK_SIZE = 500; // words
const CHUNK_OVERLAP = 50; // carry context across boundaries- Too small (~100 words) → loses narrative; more chunks to juggle
- Too big (~2000 words) → fuzzy retrieval; harder for the LLM to stay precise
- ~500 words → strong semantic coherence + good hit rate
The Build
Step 1: Scrape the Letters
// Download 47 PDFs from berkshirehathaway.com
const LETTERS = [
{ year: 2023, url: 'https://www.berkshirehathaway.com/letters/2023ltr.pdf' },
{ year: 2022, url: 'https://www.berkshirehathaway.com/letters/2022ltr.pdf' },
// ... 45 more years
];
for (const { year, url } of LETTERS) {
await downloadFile(url, `${year}.pdf`);
}Output: 47 PDFs (~500 pages total).
Step 2: Text Extraction
import pdfParse from 'pdf-parse';
import fs from 'node:fs';
const dataBuffer = fs.readFileSync(pdfPath);
const { text } = await pdfParse(dataBuffer);
const cleanText = text
.replace(/\s+/g, ' ')
.replace(/[""]/g, '"')
.replace(/^\d+\s*$/gm, ''); // strip page numbersResult: ~530,000 words of clean text.
Step 3: Chunking with Overlap
function chunkText(text: string, chunkSize: number, overlap: number): string[] {
const words = text.split(/\s+/);
const chunks: string[] = [];
for (let i = 0; i < words.length; i += chunkSize - overlap) {
chunks.push(words.slice(i, i + chunkSize).join(' '));
}
return chunks;
}The 50-word overlap preserves continuity:
- Chunk 1: “…We believe in long-term value investing…”
- Chunk 2: “…long-term value investing requires patience…”
Total: 673 chunks.
Step 4: Generate Embeddings
// Gemini text-embedding-004 → 768-d vectors
const genAI = new GoogleGenerativeAI(apiKey);
const model = genAI.getGenerativeModel({ model: 'text-embedding-004' });
for (const c of chunks) {
const res = await model.embedContent(c);
embeddings.push(res.embedding.values); // number[768]
}Output: embeddings.json (~5 MB).
Step 5: Lambda Runtime
export const handler = async (event) => {
const { query } = JSON.parse(event.body);
// 1) Embed the query
const q = await embedModel.embedContent(query);
// 2) Score by cosine similarity
const scored = embeddingsData.chunks.map((chunk) => ({
...chunk,
sim: cosineSimilarity(q, chunk.embedding),
}));
// 3) Top-K
const top = scored.sort((a, b) => b.sim - a.sim).slice(0, 5);
// 4) Compose prompt context
const context = top.map((t) => t.text).join('\n\n');
// 5) Generate grounded answer
const prompt = `Context:\n${context}\n\nQuestion: ${query}\n\nAnswer with citations (years):`;
const answer = await geminiLLM.generateContent(prompt);
return { answer, sources: top };
};Cosine similarity:
function cosineSimilarity(a: number[], b: number[]): number {
let dot = 0, magA = 0, magB = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i];
magA += a[i] * a[i];
magB += b[i] * b[i];
}
return dot / (Math.sqrt(magA) * Math.sqrt(magB));
}Challenges & Solutions
- PDF quality (older scans): Aggressive cleaning and validation passes.
- Boundary context loss: 50-word overlap fixed mid-paragraph splits.
- Top-K selection: 3 missed context; 10 added noise. 5 was the sweet spot in practice.
Performance
Cold start
- Lambda init: ~1.5s
- Embed query: ~200ms
- Vector search: ~50ms
- LLM generation: ~1.0s
- Total: ~2.75s
Warm
- Total: ~1.25s
Cost per query
- Gemini (embed + LLM): Free tier
- Lambda: ~$0.0000002
- API Gateway: ~$0.000001
- All-in: ~$0.000001
What I Learned
- “Simple + fast” beats “complex + fancy.” Bundled vectors are underrated for medium corpora.
- Data > model. I spent more time on parsing and chunking than on embedding models—and it paid off.
- Costs can round to zero. Free-tier Gemini + bundled vectors + serverless is a cheat code.
- Citations change behavior. When the answer includes specific years, users trust—and verify.
Personal note: This project married two things I care about—AI systems in production and real-world finance. Building a tool on top of Buffett’s letters felt like installing a fast index on a library I already love.
Technical Stack
- Data Pipeline: TypeScript +
pdf-parse - Embeddings: Gemini
text-embedding-004(768 dimensions) - Backend: AWS Lambda (Node.js 22) + API Gateway
- Infra: AWS CDK (TypeScript)
- Frontend: Astro + TypeScript + Tailwind CSS
- Vector Store: Bundled JSON (~5 MB)



