Why LangSmith is the Backbone of our LLM Evaluation Stack

By Bihani Madushika – BI Team Lead, Framework Friday / WebLife

Published on

July 10, 2025

Introduction: From Black Box to Evaluation-First AI

Building LLM workflows without evaluation is like flying blind. Inputs go in. Responses come out. But what happened in between? Was it helpful? Was it right? At Framework Friday, we don’t guess. We trace, score, and improve everything we build.That’s why LangSmith is the core observability layer in every agent system we ship. It’s not just for devs. it’s how BI, product, and operators track what matters: output quality, latency, cost, and continuous performance.

This post explains why we chose LangSmith, what it enables, and how it fits into a real, repeatable evaluation system.
‍

Why Evaluation Can’t Be an Afterthought

Modern LLM workflows are complex. Agents make decisions. Tools run logic. Memory changes outcomes. Without observability, you can’t trust performance. You can’t explain failure cases. You can’t optimize prompts or tools. You lose time debugging what you can’t see.

LangSmith changes that. It turns every agent run into a traceable, testable, reviewable event with metrics to prove what’s working.
‍

The Framework Friday LLM Evaluation Stack

LangSmith is the backbone of our observability loop. It connects inputs to outputs with full traceability, exposing the hidden logic, edge cases, and failure points inside even the most complex LLM workflows. While it excels with agentic systems, LangSmith’s core strengths - tracing, evaluation, and performance monitoring apply to any LLM use case: from basic prompt chains to advanced RAG systems or multi-step agents.

We pair LangSmith with self-hosted n8n to orchestrate workflows, GPT-4 and Claude for generation, and Supabase as our vector store for high-precision retrieval.

Every run from tool invocation to LLM output is logged and scored in real time. LangSmith’s dashboards give our entire team a shared, transparent view of system behavior across engineering, BI, and product.

What LangSmith Unlocks for Operators

Before LangSmith, we relied on logs and hunches. Prompt changes weren’t tracked. Evaluation was inconsistent and slow. Stakeholders had no visibility. Now, every agent run is logged with full context: what was asked, how it responded, what tools it used, how long it took, how much it cost, and how helpful the answer actually was.

We track every prompt version. We score every output. We tag every run. Debugging takes minutes instead of hours. Token burn is under control. And edge cases get flagged before they hit production. LangSmith gives us observability that scales.

How We Evaluate LLM Output

We use a mix of automatic and human-in-the-loop evaluation. LangSmith’s “LLM-as-Judge” lets us use GPT or Claude to score responses against rubrics like correctness, helpfulness, and tone. We also use pairwise comparison to test new prompts against old ones. When LLM scoring falls short like for tone or clarity. We queue traces for human review with built-in annotation tools.

This evaluation system helps us improve prompts, catch regressions, and ship better agents week after week.

Why LangSmith, Not a Custom Stack

We tested other approaches like building workflows in n8n or writing custom scoring scripts. They worked for small tests, but didn’t scale.

LangSmith gave us:

A full trace of every run, including tools and memory
Built-in scoring, not bolt-on hacks
Prompt version control and rollback
Human review queues for edge cases
Real-time dashboards for cost, latency, and success rates

And unlike other observability platforms, LangSmith is built specifically for LLMs, not just generic ML.

‍

👉 Join the Framework Friday community

Get our full LangSmith template pack: evaluation configs, scoring prompts, and workflow exports → allinonai.frameworkfriday.com/c/join-now

‍

Let’s build agentic systems you can measure, trace, and trust.

‍