· AI & Development  · 5 min read

Building Ask Warren: A Production RAG System with 47 Years of Investment Wisdom

How I built a retrieval-augmented generation system that lets you chat with Warren Buffett's shareholder letters - to understand RAG trade-offs in production.

How I built a retrieval-augmented generation system that lets you chat with Warren Buffett's shareholder letters - to understand RAG trade-offs in production.

I’ve been curious about RAG (Retrieval-Augmented Generation) for a while. Reading about a technology and actually shipping it are very different. I wanted to feel the real friction—parsing, chunking, embeddings, latency, cost, quality—and see the upside. I like to think in trade-offs.

I had the perfect excuse.

I look up to Warren Buffett. I have a deep interest in finance, and I treat the Berkshire Hathaway shareholder letters like a personal curriculum—47 years of clear thinking on capital allocation, incentives, risk, and integrity. I’ve read them all. Multiple times. They’re dense with ideas you can actually use.

So I built Ask Warren—a production RAG system that answers questions from those letters, with sources and year citations, fast and cheap.

Why Warren Buffett?

I needed a corpus that’s:

  • High-signal (no fluff; consistent voice and quality)
  • Publicly available (PDFs from 1977–2023)
  • Actually interesting to query (things you’d ask a real investor)

And, personally:

  • I admire Buffett’s clarity and discipline.
  • The letters are the primary source. No hot takes. No summaries. Just the original text.
  • I’ve already internalized a lot of it; now I wanted a tool to interrogate and cross-reference it on demand.

Examples I wanted to explore:

  • “What does Warren think about cryptocurrency?”
  • “How should I think about market downturns?”
  • “What makes a great manager?”

The Vision

Users ask questions in plain English. The system finds relevant passages from 47 years of letters. Gemini generates answers with explicit year/source citations. Cost: essentially free (~$0.000001 per query).

Understanding RAG (in practice)

Traditional LLM:

User: "What does Warren think about Bitcoin?"
LLM: ...makes up something plausible from training data...

RAG:

User: "What does Warren think about Bitcoin?"

1) Embed the question
2) Search 673 chunks from the letters
3) Take top 5 most relevant
4) Inject those chunks as context
5) Generate an answer that cites the years/sources

"Based on his 2018 and 2020 letters, Warren said..."

Why this matters: Grounded answers with receipts, not vibes.

Key Architecture Decisions

Decision 1: Bundle Embeddings in Lambda

Options for vector storage:

  • Pinecone (managed): ~$70/mo + network latency
  • S3 + load at runtime: $0 storage, but S3 latency (~100ms) per cold start
  • Bundle with Lambda: $0, lowest latency, simplest

I chose bundled embeddings:

// 673 chunks × 768 dims × 4 bytes ≈ ~2 MB
// Way under Lambda's 250 MB unzipped limit
// No network hop → ~50ms search vs ~150ms with remote calls
// $0 infra for vectors

Trade-off: not for giant corpora. But for this dataset, it’s a perfect fit.

Decision 2: Gemini Over OpenAI (for this use case)

Gemini text-embedding-004

  • Free tier (2M tokens/day at time of build)
  • 768-dim vectors (smaller = faster, good enough quality)
  • Works well for semantic retrieval

OpenAI text-embedding-3

  • Low cost but not free
  • 1536-dim vectors (larger payloads)
  • Slightly better in some benchmarks

For this project: Free + fast > marginally better.

Decision 3: Chunk Size Matters (a lot)

After testing:

const CHUNK_SIZE = 500;     // words
const CHUNK_OVERLAP = 50;   // carry context across boundaries
  • Too small (~100 words) → loses narrative; more chunks to juggle
  • Too big (~2000 words) → fuzzy retrieval; harder for the LLM to stay precise
  • ~500 words → strong semantic coherence + good hit rate

The Build

Step 1: Scrape the Letters

// Download 47 PDFs from berkshirehathaway.com
const LETTERS = [
  { year: 2023, url: 'https://www.berkshirehathaway.com/letters/2023ltr.pdf' },
  { year: 2022, url: 'https://www.berkshirehathaway.com/letters/2022ltr.pdf' },
  // ... 45 more years
];

for (const { year, url } of LETTERS) {
  await downloadFile(url, `${year}.pdf`);
}

Output: 47 PDFs (~500 pages total).

Step 2: Text Extraction

import pdfParse from 'pdf-parse';
import fs from 'node:fs';

const dataBuffer = fs.readFileSync(pdfPath);
const { text } = await pdfParse(dataBuffer);

const cleanText = text
  .replace(/\s+/g, ' ')
  .replace(/[""]/g, '"')
  .replace(/^\d+\s*$/gm, ''); // strip page numbers

Result: ~530,000 words of clean text.

Step 3: Chunking with Overlap

function chunkText(text: string, chunkSize: number, overlap: number): string[] {
  const words = text.split(/\s+/);
  const chunks: string[] = [];

  for (let i = 0; i < words.length; i += chunkSize - overlap) {
    chunks.push(words.slice(i, i + chunkSize).join(' '));
  }
  return chunks;
}

The 50-word overlap preserves continuity:

  • Chunk 1: “…We believe in long-term value investing…”
  • Chunk 2: “…long-term value investing requires patience…”

Total: 673 chunks.

Step 4: Generate Embeddings

// Gemini text-embedding-004 → 768-d vectors
const genAI = new GoogleGenerativeAI(apiKey);
const model = genAI.getGenerativeModel({ model: 'text-embedding-004' });

for (const c of chunks) {
  const res = await model.embedContent(c);
  embeddings.push(res.embedding.values); // number[768]
}

Output: embeddings.json (~5 MB).

Step 5: Lambda Runtime

export const handler = async (event) => {
  const { query } = JSON.parse(event.body);

  // 1) Embed the query
  const q = await embedModel.embedContent(query);

  // 2) Score by cosine similarity
  const scored = embeddingsData.chunks.map((chunk) => ({
    ...chunk,
    sim: cosineSimilarity(q, chunk.embedding),
  }));

  // 3) Top-K
  const top = scored.sort((a, b) => b.sim - a.sim).slice(0, 5);

  // 4) Compose prompt context
  const context = top.map((t) => t.text).join('\n\n');

  // 5) Generate grounded answer
  const prompt = `Context:\n${context}\n\nQuestion: ${query}\n\nAnswer with citations (years):`;
  const answer = await geminiLLM.generateContent(prompt);

  return { answer, sources: top };
};

Cosine similarity:

function cosineSimilarity(a: number[], b: number[]): number {
  let dot = 0, magA = 0, magB = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    magA += a[i] * a[i];
    magB += b[i] * b[i];
  }
  return dot / (Math.sqrt(magA) * Math.sqrt(magB));
}

Challenges & Solutions

  • PDF quality (older scans): Aggressive cleaning and validation passes.
  • Boundary context loss: 50-word overlap fixed mid-paragraph splits.
  • Top-K selection: 3 missed context; 10 added noise. 5 was the sweet spot in practice.

Performance

Cold start

  • Lambda init: ~1.5s
  • Embed query: ~200ms
  • Vector search: ~50ms
  • LLM generation: ~1.0s
  • Total: ~2.75s

Warm

  • Total: ~1.25s

Cost per query

  • Gemini (embed + LLM): Free tier
  • Lambda: ~$0.0000002
  • API Gateway: ~$0.000001
  • All-in: ~$0.000001

What I Learned

  1. “Simple + fast” beats “complex + fancy.” Bundled vectors are underrated for medium corpora.
  2. Data > model. I spent more time on parsing and chunking than on embedding models—and it paid off.
  3. Costs can round to zero. Free-tier Gemini + bundled vectors + serverless is a cheat code.
  4. Citations change behavior. When the answer includes specific years, users trust—and verify.

Personal note: This project married two things I care about—AI systems in production and real-world finance. Building a tool on top of Buffett’s letters felt like installing a fast index on a library I already love.

Technical Stack

  • Data Pipeline: TypeScript + pdf-parse
  • Embeddings: Gemini text-embedding-004 (768 dimensions)
  • Backend: AWS Lambda (Node.js 22) + API Gateway
  • Infra: AWS CDK (TypeScript)
  • Frontend: Astro + TypeScript + Tailwind CSS
  • Vector Store: Bundled JSON (~5 MB)
Back to Blog

Related Posts

View All Posts »