Building a Production RAG System for AI-Powered Tutoring
Posted on April 12, 2026 · Tags: AI Engineering, RAG, Vector Databases, Machine Learning, PostgreSQL
When building AICademy, my AI-powered tutoring platform, I faced a critical challenge: how do you give an AI tutor the right context without overwhelming it with information? The answer was implementing a production-ready RAG (Retrieval-Augmented Generation) system that retrieves only the most relevant content for each student question.
The Problem
Initially, my AI tutor received entire lessons with every prompt. This created three major issues:
- Oversized prompts – Wasted tokens and increased costs
- Generic responses – Too much irrelevant context confused the AI
- Poor scalability – Can't handle large content libraries
With RAG, the system now retrieves only the 3-5 most relevant content chunks based on semantic similarity to the student's question. This resulted in 80% smaller prompts and significantly more targeted, accurate responses.

Technical Architecture
Here's the technology stack I chose for this production system:
- Embeddings: Ollama with nomic-embed-text (768-dimensional vectors)
- Vector Database: PostgreSQL with pgvector extension
- Platform: Supabase
- Runtime: Next.js 15 + TypeScript
- Cost: $0 (fully local embeddings vs OpenAI API fees)
Why These Choices?
I chose Ollama over OpenAI for several reasons:
- Free, local embedding generation
- Privacy-first (no data sent to third parties)
- Fast performance (<100ms per embedding)
- 768-dim vectors optimized for retrieval tasks
For the vector database, I went with pgvector over Pinecone/Weaviate because:
- Leverages existing PostgreSQL infrastructure
- Native SQL integration
- No additional service to manage
- Battle-tested indexing algorithms (IVFFlat)
Implementation Process
Step 1: Semantic Chunking
I built a content chunker that intelligently splits lessons into semantic units by preserving heading structure and context:
// Chunks lessons by heading structure to preserve context
const chunks = chunkLesson({
lessonId: "lesson-uuid",
content: "<h2>Introduction</h2><p>Content...</p>",
moduleId: "module-uuid"
});
// Output: Array of semantically meaningful chunks
// Each chunk preserves its section title and contextResults: 31 lessons transformed into 29 semantic chunks, each preserving its section title and context.
Step 2: Vectorization with Local Embeddings
Next, I created an Ollama wrapper to generate embeddings for all content:
// Generate 768-dimensional embedding vector
const embedding = await getEmbedding("How do loops work?");
// Returns: [0.123, -0.456, 0.789, ...] (768 numbers)
// Batch process all lesson chunks
const embeddings = await getBatchEmbeddings(chunks, {
onProgress: (current, total) => console.log(`${current}/${total}`)
});Validation: I achieved 76.4% similarity between semantically related queries like "How do parameters work?" and "What are function parameters?"
Step 3: Vector Database Setup
I configured PostgreSQL with pgvector for similarity search:
-- Create table with vector column (768 dimensions)
CREATE TABLE lesson_chunks (
id UUID PRIMARY KEY,
chunk_text TEXT NOT NULL,
embedding vector(768),
section_title TEXT,
lesson_id UUID,
module_id UUID
);
-- Create IVFFlat index for fast similarity search
CREATE INDEX idx_lesson_chunks_embedding
ON lesson_chunks
USING ivfflat (embedding vector_cosine_ops);Step 4: Semantic Search Implementation
I built the vector search function using cosine similarity:
export async function searchLessonChunks(params: {
query: string;
moduleId?: string;
matchThreshold?: number;
matchCount?: number;
}) {
const { query, moduleId, matchThreshold = 0.5, matchCount = 5 } = params;
// Generate query embedding
const queryEmbedding = await getEmbedding(query);
const vectorString = `[${queryEmbedding.join(',')}]`;
// Perform vector similarity search via PostgreSQL RPC
const { data, error } = await supabase.rpc('search_lesson_chunks', {
query_embedding: vectorString,
match_module_id: moduleId || null,
match_threshold: matchThreshold,
match_count: matchCount,
});
if (error) throw error;
return data.map((row: any) => ({
sectionTitle: row.section_title,
chunkText: row.chunk_text,
similarity: row.similarity,
}));
}The PostgreSQL function uses the cosine distance operator (<=>) for efficient similarity matching:
CREATE FUNCTION search_lesson_chunks(
query_embedding TEXT,
match_module_id UUID DEFAULT NULL,
match_threshold FLOAT DEFAULT 0.5,
match_count INT DEFAULT 5
)
RETURNS TABLE (
section_title TEXT,
chunk_text TEXT,
similarity FLOAT
)
AS $$
BEGIN
RETURN QUERY
SELECT
lc.section_title,
lc.chunk_text,
1 - (lc.embedding <=> query_embedding::vector(768)) AS similarity
FROM lesson_chunks lc
WHERE
(match_module_id IS NULL OR lc.module_id = match_module_id)
AND (1 - (lc.embedding <=> query_embedding::vector(768))) >= match_threshold
ORDER BY lc.embedding <=> query_embedding::vector(768)
LIMIT match_count;
END;
$$ LANGUAGE plpgsql;Key Engineering Challenge
The biggest challenge was type compatibility across stack layers. The Supabase JavaScript client couldn't pass embedding arrays as vector(768) parameters to PostgreSQL RPC functions.
Symptoms:
- SQL queries worked perfectly
- JavaScript RPC calls returned 0 results
- No error messages (silent failure)
Solution:
- Changed RPC function parameter from
vector(768)toTEXT - Cast TEXT to vector internally:
query_embedding::vector(768) - Updated TypeScript to send embeddings as strings:
[0.1,0.2,0.3,...]
This workaround balances type safety with cross-platform compatibility.
Results & Validation
Here are real test queries and their top results:
Query: "How do loops work in programming?"
- Result 1: "Common Loop Patterns" - 68.2% similarity ✅
- Result 2: "While Loops" - 55.7% similarity ✅
- Result 3: "Pattern Recognition" - 48.9% similarity
Query: "What is scientific thinking?"
- Result 1: "Core Steps of Scientific Thinking" - 66.5% similarity ✅
- Result 2: "Making Observations" - 61.3% similarity ✅
Performance Metrics
- Embedding Generation: <100ms per query
- Vector Search: <50ms per search
- Similarity Scores: 0.3-0.7 range for relevant content
- Chunks Retrieved: 5 per query (configurable)
- Prompt Size Reduction: ~80% (5 chunks vs full lessons)
Impact on AI Tutoring
The difference is dramatic:
Before RAG:
Student: "How do loops work?"
System: [Sends entire 5000-token lesson on programming basics]
AI: [Generic response based on too much context]
After RAG:
Student: "How do loops work?"
System: [Retrieves 3 chunks: "Common Loop Patterns", "While Loops", "For Loop Examples"]
AI: [Focused, relevant response about loops specifically]
Key Takeaways
- RAG isn't just about retrieval – chunking strategy and similarity thresholds are critical
- Local embeddings are viable – Ollama provides production-quality embeddings at $0 cost
- PostgreSQL can be a vector DB – pgvector eliminates need for specialized services
- Type systems matter – cross-platform type compatibility requires careful engineering
- Validation is essential – semantic similarity tests ensure quality before production
Future Enhancements
- Hybrid search (semantic + keyword)
- Re-ranking with cross-encoders
- Dynamic chunk sizing based on content type
- Caching layer for frequent queries
- A/B testing RAG vs non-RAG responses
Bottom Line:
Built a production RAG system from scratch that makes AI tutoring responses 80% more efficient and significantly more relevant, all while keeping costs at $0 through local embedding generation.