· AI & Development  · 5 min read

Building Ask Warren: A Production RAG System with 47 Years of Investment Wisdom

How I built a retrieval-augmented generation system that lets you chat with Warren Buffett's shareholder letters - to understand RAG trade-offs in production.

How I built a retrieval-augmented generation system that lets you chat with Warren Buffett's shareholder letters - to understand RAG trade-offs in production.

I’ve been curious about RAG (Retrieval-Augmented Generation) for a while. Reading about a technology and actually shipping it are very different. I wanted to feel the real friction—parsing, chunking, embeddings, latency, cost, quality—and see the upside. I like to think in trade-offs.

I had the perfect excuse.

I look up to Warren Buffett. I have a deep interest in finance, and I treat the Berkshire Hathaway shareholder letters like a personal curriculum—47 years of clear thinking on capital allocation, incentives, risk, and integrity. I’ve read them all. Multiple times. They’re dense with ideas you can actually use.

So I built Ask Warren—a production RAG system that answers questions from those letters, with sources and year citations, fast and cheap.

Why Warren Buffett?

I needed a corpus that’s:

  • High-signal (no fluff; consistent voice and quality)
  • Publicly available (PDFs from 1977–2023)
  • Actually interesting to query (things you’d ask a real investor)

And, personally:

  • I admire Buffett’s clarity and discipline.
  • The letters are the primary source. No hot takes. No summaries. Just the original text.
  • I’ve already internalized a lot of it; now I wanted a tool to interrogate and cross-reference it on demand.

Examples I wanted to explore:

  • “What does Warren think about cryptocurrency?”
  • “How should I think about market downturns?”
  • “What makes a great manager?”

The Vision

Users ask questions in plain English. The system finds relevant passages from 47 years of letters. Gemini generates answers with explicit year/source citations. Cost: essentially free (~$0.000001 per query).

Understanding RAG (in practice)

Traditional LLM:

User: "What does Warren think about Bitcoin?"
LLM: ...makes up something plausible from training data...

RAG:

User: "What does Warren think about Bitcoin?"

1) Embed the question
2) Search 673 chunks from the letters
3) Take top 5 most relevant
4) Inject those chunks as context
5) Generate an answer that cites the years/sources

"Based on his 2018 and 2020 letters, Warren said..."

Why this matters: Grounded answers with receipts, not vibes.

Key Architecture Decisions

Decision 1: Bundle Embeddings in Lambda

Options for vector storage:

  • Pinecone (managed): ~$70/mo + network latency
  • S3 + load at runtime: $0 storage, but S3 latency (~100ms) per cold start
  • Bundle with Lambda: $0, lowest latency, simplest

I chose bundled embeddings:

// 673 chunks × 768 dims × 4 bytes ≈ ~2 MB
// Way under Lambda's 250 MB unzipped limit
// No network hop → ~50ms search vs ~150ms with remote calls
// $0 infra for vectors

Trade-off: not for giant corpora. But for this dataset, it’s a perfect fit.

Decision 2: Gemini Over OpenAI (for this use case)

Gemini text-embedding-004

  • Free tier (2M tokens/day at time of build)
  • 768-dim vectors (smaller = faster, good enough quality)
  • Works well for semantic retrieval

OpenAI text-embedding-3

  • Low cost but not free
  • 1536-dim vectors (larger payloads)
  • Slightly better in some benchmarks

For this project: Free + fast > marginally better.

Decision 3: Chunk Size Matters (a lot)

After testing:

const CHUNK_SIZE = 500;     // words
const CHUNK_OVERLAP = 50;   // carry context across boundaries
  • Too small (~100 words) → loses narrative; more chunks to juggle
  • Too big (~2000 words) → fuzzy retrieval; harder for the LLM to stay precise
  • ~500 words → strong semantic coherence + good hit rate

The Build

Step 1: Scrape the Letters

// Download 47 PDFs from berkshirehathaway.com
const LETTERS = [
  { year: 2023, url: 'https://www.berkshirehathaway.com/letters/2023ltr.pdf' },
  { year: 2022, url: 'https://www.berkshirehathaway.com/letters/2022ltr.pdf' },
  // ... 45 more years
];

for (const { year, url } of LETTERS) {
  await downloadFile(url, `${year}.pdf`);
}

Output: 47 PDFs (~500 pages total).

Step 2: Text Extraction

import pdfParse from 'pdf-parse';
import fs from 'node:fs';

const dataBuffer = fs.readFileSync(pdfPath);
const { text } = await pdfParse(dataBuffer);

const cleanText = text
  .replace(/\s+/g, ' ')
  .replace(/[""]/g, '"')
  .replace(/^\d+\s*$/gm, ''); // strip page numbers

Result: ~530,000 words of clean text.

Step 3: Chunking with Overlap

function chunkText(text: string, chunkSize: number, overlap: number): string[] {
  const words = text.split(/\s+/);
  const chunks: string[] = [];

  for (let i = 0; i < words.length; i += chunkSize - overlap) {
    chunks.push(words.slice(i, i + chunkSize).join(' '));
  }
  return chunks;
}

The 50-word overlap preserves continuity:

  • Chunk 1: “…We believe in long-term value investing…”
  • Chunk 2: “…long-term value investing requires patience…”

Total: 673 chunks.

Step 4: Generate Embeddings

// Gemini text-embedding-004 → 768-d vectors
const genAI = new GoogleGenerativeAI(apiKey);
const model = genAI.getGenerativeModel({ model: 'text-embedding-004' });

for (const c of chunks) {
  const res = await model.embedContent(c);
  embeddings.push(res.embedding.values); // number[768]
}

Output: embeddings.json (~5 MB).

Step 5: Lambda Runtime

export const handler = async (event) => {
  const { query } = JSON.parse(event.body);

  // 1) Embed the query
  const q = await embedModel.embedContent(query);

  // 2) Score by cosine similarity
  const scored = embeddingsData.chunks.map((chunk) => ({
    ...chunk,
    sim: cosineSimilarity(q, chunk.embedding),
  }));

  // 3) Top-K
  const top = scored.sort((a, b) => b.sim - a.sim).slice(0, 5);

  // 4) Compose prompt context
  const context = top.map((t) => t.text).join('\n\n');

  // 5) Generate grounded answer
  const prompt = `Context:\n${context}\n\nQuestion: ${query}\n\nAnswer with citations (years):`;
  const answer = await geminiLLM.generateContent(prompt);

  return { answer, sources: top };
};

Cosine similarity:

function cosineSimilarity(a: number[], b: number[]): number {
  let dot = 0, magA = 0, magB = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    magA += a[i] * a[i];
    magB += b[i] * b[i];
  }
  return dot / (Math.sqrt(magA) * Math.sqrt(magB));
}

Challenges & Solutions

  • PDF quality (older scans): Aggressive cleaning and validation passes.
  • Boundary context loss: 50-word overlap fixed mid-paragraph splits.
  • Top-K selection: 3 missed context; 10 added noise. 5 was the sweet spot in practice.

Performance

Cold start

  • Lambda init: ~1.5s
  • Embed query: ~200ms
  • Vector search: ~50ms
  • LLM generation: ~1.0s
  • Total: ~2.75s

Warm

  • Total: ~1.25s

Cost per query

  • Gemini (embed + LLM): Free tier
  • Lambda: ~$0.0000002
  • API Gateway: ~$0.000001
  • All-in: ~$0.000001

What I Learned

  1. “Simple + fast” beats “complex + fancy.” Bundled vectors are underrated for medium corpora.
  2. Data > model. I spent more time on parsing and chunking than on embedding models—and it paid off.
  3. Costs can round to zero. Free-tier Gemini + bundled vectors + serverless is a cheat code.
  4. Citations change behavior. When the answer includes specific years, users trust—and verify.

Personal note: This project married two things I care about—AI systems in production and real-world finance. Building a tool on top of Buffett’s letters felt like installing a fast index on a library I already love.

Technical Stack

  • Data Pipeline: TypeScript + pdf-parse
  • Embeddings: Gemini text-embedding-004 (768 dimensions)
  • Backend: AWS Lambda (Node.js 22) + API Gateway
  • Infra: AWS CDK (TypeScript)
  • Frontend: Astro + TypeScript + Tailwind CSS
  • Vector Store: Bundled JSON (~5 MB)
Back to Blog

Related Posts

View All Posts »
Raising Children Before Memory

Raising Children Before Memory

Right now, my daughters won't remember any of these early years—their first steps, first words, the little moments we share each day. But this time is still critical. This is when their Self is formed, and each interaction is helping shape that inner voice they'll carry through life.