- Built and deployed a production-grade RAG chatbot for study-abroad counseling, designing a router workflow with retrieval grading, user profile extraction, and resilient fallback handling for LLM calls.
- End-to-end ingestion and retrieval pipeline over PostgreSQL + PGVector — chunking, embedding, and efficient semantic search across 3,100+ documents.
- Implemented a hybrid (keyword + semantic) citation validator with clickable source-linked outputs and localized currency normalization for NPR.
- Optimized API reliability via async I/O, connection pooling, and multi-worker Gunicorn. Load-tested with Locust; deployed via systemd on Azure VPS.
Introduction
As my latest professional venture at Hobes Tech, I was responsible for architecting and building a production-grade RAG-based chatbot for an Education Consultancy. This project was a trial by fire that tested my resourcefulness — fitting a high-performance ingestion pipeline into a constrained VPS and maintaining reliability despite model provider failure.
Purpose and Goal
In study-abroad consulting, trust is the primary currency; counselors bear responsibility for a student's entire future. We couldn't afford hallucinations. My goal was to ground every response in verifiable facts by implementing a robust citation system where every claim includes a clickable link directly to the source document.
Spotlight
- Citation Engine: To solve the attribution problem, I implemented a post-processing hybrid correction algorithm inspired by recent research (e.g., Enhancing RAG Accuracy Through Post-Processing Citation Correction). By combining semantic similarity with keyword matching, the system dynamically drops or reassigns citations to ensure the AI isn't just guessing where it found the info.
-
Performance Optimization:
Users expect instant replies, but a workflow involving an
orchestrator, retriever, reranker, and currency localizer can
get heavy. I optimized the Time to First Token (TTFT) to
under one second on average by:
- Adding a path for FAQs in the workflow that entirely skips the retrieval and reranking route.
- Moving heavy ingestion tasks to background thread pools.
- Using Async I/O for all network-bound operations.
- Implementing Connection Pooling for the database.
- Deploying via Gunicorn with multiple workers to handle concurrent traffic.
- Currency Localization: In order to improve User Experience, I implemented a mechanism to find any mentions of currency or monetary information in the chatbot response, and perform on-the-fly conversion to Nepali currency with the proper formatting.
Current Status
It's in production and is being used regularly by students.
Lessons Learned
This project taught me the production gap — the difference between a demo and a system that survives real-world constraints. I learned the trade-offs between agentic vs. deterministic workflows and how to build failure-tolerant systems when upstream APIs are unreliable.