The first wave of AI chatbots focused on response generation. The second wave must focus on reliability. At Framework Friday, we build agentic systems that don’t just talk - they trace, evaluate, and improve themselves.
In this post, we share how our Tiger Team built a production-ready chatbot stack using self-hosted n8n for orchestration and LangSmith for evaluation and observability. It’s modular, private, and performance-verified - built for operators who care about what’s actually happening under the hood.
Today’s LLM apps are complex, multi-step systems. Without full evaluation, you risk:
That’s why we built evaluation into the system from the start - not as an afterthought.
Our chatbot system is structured around five core layers:
This modular setup gives us control, visibility, and performance - without relying on cloud-native SaaS stacks.
We used Docker Desktop to deploy n8n locally, exposing port 5678 and mapping volumes for persistence.
Key environment variables connected the system to LangSmith:
LANGCHAIN_TRACING_V2=true
LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
LANGCHAIN_API_KEY=your_key
LANGCHAIN_PROJECT=chatbot-evaluation
This gave us a GUI-accessible flow builder with built-in trace logging for every agent run.
Knowledge ingestion followed this pipeline:
Google Drive → Data Loader → Chunking → Embeddings → Supabase Vector Store
Optimization decisions:
This setup gave us high-relevance context without noise.
Using LangChain’s Tools Agent, we configured:
Each interaction logged tool use, response metadata, and agent reasoning paths.
LangSmith added full observability:
LangSmith didn’t just show us what happened - it showed why and at what cost.
This wasn’t just a chatbot project. It became a blueprint for agentic evaluation.
Key wins:
Instead of shipping another black-box tool, we built an agentic layer we can trust.
Every chat flow now generates structured, reviewable logs. We don’t rely on anecdotal feedback. We trace and score everything - from the first message to the final token.
Evaluation is not just about performance. It’s about confidence, governance, and repeatability.
This system proves that self-hosted, evaluation-first AI is not only possible - it’s practical.
By combining n8n’s flexibility with LangSmith’s evaluation backbone, we turned an experimental chatbot into an operator-ready system.
Coming next: This full setup - workflows, config, and prompt logic - will be available as a Framework Friday template
👉 Join the community at allinonai.frameworkfriday.com/c/join-now/
Let’s build agentic systems you can measure, trace, and trust.