Building a Production RAG System for AI-Powered Tutoring

Posted on April 12, 2026 · Tags: AI Engineering, RAG, Vector Databases, Machine Learning, PostgreSQL

When building AICademy, my AI-powered tutoring platform, I faced a critical challenge: how do you give an AI tutor the right context without overwhelming it with information? The answer was implementing a production-ready RAG (Retrieval-Augmented Generation) system that retrieves only the most relevant content for each student question.

The Problem

Initially, my AI tutor received entire lessons with every prompt. This created three major issues:

Oversized prompts – Wasted tokens and increased costs
Generic responses – Too much irrelevant context confused the AI
Poor scalability – Can't handle large content libraries

With RAG, the system now retrieves only the 3-5 most relevant content chunks based on semantic similarity to the student's question. This resulted in 80% smaller prompts and significantly more targeted, accurate responses.

Technical Architecture

Here's the technology stack I chose for this production system:

Embeddings: Ollama with nomic-embed-text (768-dimensional vectors)
Vector Database: PostgreSQL with pgvector extension
Platform: Supabase
Runtime: Next.js 15 + TypeScript
Cost: $0 (fully local embeddings vs OpenAI API fees)

Why These Choices?

I chose Ollama over OpenAI for several reasons:

Free, local embedding generation
Privacy-first (no data sent to third parties)
Fast performance (<100ms per embedding)
768-dim vectors optimized for retrieval tasks

For the vector database, I went with pgvector over Pinecone/Weaviate because:

Leverages existing PostgreSQL infrastructure
Native SQL integration
No additional service to manage
Battle-tested indexing algorithms (IVFFlat)

Implementation Process

Step 1: Semantic Chunking

I built a content chunker that intelligently splits lessons into semantic units by preserving heading structure and context:

// Chunks lessons by heading structure to preserve context
const chunks = chunkLesson({
  lessonId: "lesson-uuid",
  content: "<h2>Introduction</h2><p>Content...</p>",
  moduleId: "module-uuid"
});

// Output: Array of semantically meaningful chunks
// Each chunk preserves its section title and context

Results: 31 lessons transformed into 29 semantic chunks, each preserving its section title and context.

Step 2: Vectorization with Local Embeddings

Next, I created an Ollama wrapper to generate embeddings for all content:

// Generate 768-dimensional embedding vector
const embedding = await getEmbedding("How do loops work?");
// Returns: [0.123, -0.456, 0.789, ...] (768 numbers)

// Batch process all lesson chunks
const embeddings = await getBatchEmbeddings(chunks, {
  onProgress: (current, total) => console.log(`${current}/${total}`)
});

Validation: I achieved 76.4% similarity between semantically related queries like "How do parameters work?" and "What are function parameters?"

Step 3: Vector Database Setup

I configured PostgreSQL with pgvector for similarity search:

-- Create table with vector column (768 dimensions)
CREATE TABLE lesson_chunks (
  id UUID PRIMARY KEY,
  chunk_text TEXT NOT NULL,
  embedding vector(768),
  section_title TEXT,
  lesson_id UUID,
  module_id UUID
);

-- Create IVFFlat index for fast similarity search
CREATE INDEX idx_lesson_chunks_embedding 
ON lesson_chunks 
USING ivfflat (embedding vector_cosine_ops);

Step 4: Semantic Search Implementation

I built the vector search function using cosine similarity:

export async function searchLessonChunks(params: {
  query: string;
  moduleId?: string;
  matchThreshold?: number;
  matchCount?: number;
}) {
  const { query, moduleId, matchThreshold = 0.5, matchCount = 5 } = params;

  // Generate query embedding
  const queryEmbedding = await getEmbedding(query);
  const vectorString = `[${queryEmbedding.join(',')}]`;

  // Perform vector similarity search via PostgreSQL RPC
  const { data, error } = await supabase.rpc('search_lesson_chunks', {
    query_embedding: vectorString,
    match_module_id: moduleId || null,
    match_threshold: matchThreshold,
    match_count: matchCount,
  });

  if (error) throw error;

  return data.map((row: any) => ({
    sectionTitle: row.section_title,
    chunkText: row.chunk_text,
    similarity: row.similarity,
  }));
}

The PostgreSQL function uses the cosine distance operator (<=>) for efficient similarity matching:

CREATE FUNCTION search_lesson_chunks(
  query_embedding TEXT,
  match_module_id UUID DEFAULT NULL,
  match_threshold FLOAT DEFAULT 0.5,
  match_count INT DEFAULT 5
)
RETURNS TABLE (
  section_title TEXT,
  chunk_text TEXT,
  similarity FLOAT
)
AS $$
BEGIN
  RETURN QUERY
  SELECT
    lc.section_title,
    lc.chunk_text,
    1 - (lc.embedding <=> query_embedding::vector(768)) AS similarity
  FROM lesson_chunks lc
  WHERE
    (match_module_id IS NULL OR lc.module_id = match_module_id)
    AND (1 - (lc.embedding <=> query_embedding::vector(768))) >= match_threshold
  ORDER BY lc.embedding <=> query_embedding::vector(768)
  LIMIT match_count;
END;
$$ LANGUAGE plpgsql;

Key Engineering Challenge

The biggest challenge was type compatibility across stack layers. The Supabase JavaScript client couldn't pass embedding arrays as vector(768) parameters to PostgreSQL RPC functions.

Symptoms:

SQL queries worked perfectly
JavaScript RPC calls returned 0 results
No error messages (silent failure)

Solution:

Changed RPC function parameter from vector(768) to TEXT
Cast TEXT to vector internally: query_embedding::vector(768)
Updated TypeScript to send embeddings as strings: [0.1,0.2,0.3,...]

This workaround balances type safety with cross-platform compatibility.

Results & Validation

Here are real test queries and their top results:

Query: "How do loops work in programming?"

Result 1: "Common Loop Patterns" - 68.2% similarity ✅
Result 2: "While Loops" - 55.7% similarity ✅
Result 3: "Pattern Recognition" - 48.9% similarity

Query: "What is scientific thinking?"

Result 1: "Core Steps of Scientific Thinking" - 66.5% similarity ✅
Result 2: "Making Observations" - 61.3% similarity ✅

Performance Metrics

Embedding Generation: <100ms per query
Vector Search: <50ms per search
Similarity Scores: 0.3-0.7 range for relevant content
Chunks Retrieved: 5 per query (configurable)
Prompt Size Reduction: ~80% (5 chunks vs full lessons)

Impact on AI Tutoring

The difference is dramatic:

Before RAG:

Student: "How do loops work?"
System: [Sends entire 5000-token lesson on programming basics]
AI: [Generic response based on too much context]

After RAG:

Student: "How do loops work?"
System: [Retrieves 3 chunks: "Common Loop Patterns", "While Loops", "For Loop Examples"]
AI: [Focused, relevant response about loops specifically]

Key Takeaways

RAG isn't just about retrieval – chunking strategy and similarity thresholds are critical
Local embeddings are viable – Ollama provides production-quality embeddings at $0 cost
PostgreSQL can be a vector DB – pgvector eliminates need for specialized services
Type systems matter – cross-platform type compatibility requires careful engineering
Validation is essential – semantic similarity tests ensure quality before production

Future Enhancements

Hybrid search (semantic + keyword)
Re-ranking with cross-encoders
Dynamic chunk sizing based on content type
Caching layer for frequent queries
A/B testing RAG vs non-RAG responses

Bottom Line:

Built a production RAG system from scratch that makes AI tutoring responses 80% more efficient and significantly more relevant, all while keeping costs at $0 through local embedding generation.