Building a Production RAG System for AI-Powered Tutoring

Posted on April 12, 2026 · Tags: AI Engineering, RAG, Vector Databases, Machine Learning, PostgreSQL

When building AICademy, my AI-powered tutoring platform, I faced a critical challenge: how do you give an AI tutor the right context without overwhelming it with information? The answer was implementing a production-ready RAG (Retrieval-Augmented Generation) system that retrieves only the most relevant content for each student question.

The Problem

Initially, my AI tutor received entire lessons with every prompt. This created three major issues:

With RAG, the system now retrieves only the 3-5 most relevant content chunks based on semantic similarity to the student's question. This resulted in 80% smaller prompts and significantly more targeted, accurate responses.

RAG System Architecture Diagram

Technical Architecture

Here's the technology stack I chose for this production system:

Why These Choices?

I chose Ollama over OpenAI for several reasons:

For the vector database, I went with pgvector over Pinecone/Weaviate because:

Implementation Process

Step 1: Semantic Chunking

I built a content chunker that intelligently splits lessons into semantic units by preserving heading structure and context:

// Chunks lessons by heading structure to preserve context
const chunks = chunkLesson({
  lessonId: "lesson-uuid",
  content: "<h2>Introduction</h2><p>Content...</p>",
  moduleId: "module-uuid"
});

// Output: Array of semantically meaningful chunks
// Each chunk preserves its section title and context

Results: 31 lessons transformed into 29 semantic chunks, each preserving its section title and context.

Step 2: Vectorization with Local Embeddings

Next, I created an Ollama wrapper to generate embeddings for all content:

// Generate 768-dimensional embedding vector
const embedding = await getEmbedding("How do loops work?");
// Returns: [0.123, -0.456, 0.789, ...] (768 numbers)

// Batch process all lesson chunks
const embeddings = await getBatchEmbeddings(chunks, {
  onProgress: (current, total) => console.log(`${current}/${total}`)
});

Validation: I achieved 76.4% similarity between semantically related queries like "How do parameters work?" and "What are function parameters?"

Step 3: Vector Database Setup

I configured PostgreSQL with pgvector for similarity search:

-- Create table with vector column (768 dimensions)
CREATE TABLE lesson_chunks (
  id UUID PRIMARY KEY,
  chunk_text TEXT NOT NULL,
  embedding vector(768),
  section_title TEXT,
  lesson_id UUID,
  module_id UUID
);

-- Create IVFFlat index for fast similarity search
CREATE INDEX idx_lesson_chunks_embedding 
ON lesson_chunks 
USING ivfflat (embedding vector_cosine_ops);

Step 4: Semantic Search Implementation

I built the vector search function using cosine similarity:

export async function searchLessonChunks(params: {
  query: string;
  moduleId?: string;
  matchThreshold?: number;
  matchCount?: number;
}) {
  const { query, moduleId, matchThreshold = 0.5, matchCount = 5 } = params;

  // Generate query embedding
  const queryEmbedding = await getEmbedding(query);
  const vectorString = `[${queryEmbedding.join(',')}]`;

  // Perform vector similarity search via PostgreSQL RPC
  const { data, error } = await supabase.rpc('search_lesson_chunks', {
    query_embedding: vectorString,
    match_module_id: moduleId || null,
    match_threshold: matchThreshold,
    match_count: matchCount,
  });

  if (error) throw error;

  return data.map((row: any) => ({
    sectionTitle: row.section_title,
    chunkText: row.chunk_text,
    similarity: row.similarity,
  }));
}

The PostgreSQL function uses the cosine distance operator (<=>) for efficient similarity matching:

CREATE FUNCTION search_lesson_chunks(
  query_embedding TEXT,
  match_module_id UUID DEFAULT NULL,
  match_threshold FLOAT DEFAULT 0.5,
  match_count INT DEFAULT 5
)
RETURNS TABLE (
  section_title TEXT,
  chunk_text TEXT,
  similarity FLOAT
)
AS $$
BEGIN
  RETURN QUERY
  SELECT
    lc.section_title,
    lc.chunk_text,
    1 - (lc.embedding <=> query_embedding::vector(768)) AS similarity
  FROM lesson_chunks lc
  WHERE
    (match_module_id IS NULL OR lc.module_id = match_module_id)
    AND (1 - (lc.embedding <=> query_embedding::vector(768))) >= match_threshold
  ORDER BY lc.embedding <=> query_embedding::vector(768)
  LIMIT match_count;
END;
$$ LANGUAGE plpgsql;

Key Engineering Challenge

The biggest challenge was type compatibility across stack layers. The Supabase JavaScript client couldn't pass embedding arrays as vector(768) parameters to PostgreSQL RPC functions.

Symptoms:

Solution:

  1. Changed RPC function parameter from vector(768) to TEXT
  2. Cast TEXT to vector internally: query_embedding::vector(768)
  3. Updated TypeScript to send embeddings as strings: [0.1,0.2,0.3,...]

This workaround balances type safety with cross-platform compatibility.

Results & Validation

Here are real test queries and their top results:

Query: "How do loops work in programming?"

  • Result 1: "Common Loop Patterns" - 68.2% similarity ✅
  • Result 2: "While Loops" - 55.7% similarity ✅
  • Result 3: "Pattern Recognition" - 48.9% similarity

Query: "What is scientific thinking?"

  • Result 1: "Core Steps of Scientific Thinking" - 66.5% similarity ✅
  • Result 2: "Making Observations" - 61.3% similarity ✅

Performance Metrics

Impact on AI Tutoring

The difference is dramatic:

Before RAG:

Student: "How do loops work?"
System: [Sends entire 5000-token lesson on programming basics]
AI: [Generic response based on too much context]

After RAG:

Student: "How do loops work?"
System: [Retrieves 3 chunks: "Common Loop Patterns", "While Loops", "For Loop Examples"]
AI: [Focused, relevant response about loops specifically]

Key Takeaways

  1. RAG isn't just about retrieval – chunking strategy and similarity thresholds are critical
  2. Local embeddings are viable – Ollama provides production-quality embeddings at $0 cost
  3. PostgreSQL can be a vector DB – pgvector eliminates need for specialized services
  4. Type systems matter – cross-platform type compatibility requires careful engineering
  5. Validation is essential – semantic similarity tests ensure quality before production

Future Enhancements

Bottom Line:

Built a production RAG system from scratch that makes AI tutoring responses 80% more efficient and significantly more relevant, all while keeping costs at $0 through local embedding generation.