AI intel digest

Playground in Prod: Optimising Agents in Production Environments — Samuel Colvin, Pydantic

Samuel Colvin, creator of Pydantic, demonstrated a practical workflow for optimizing AI agents in production without red

2026-05-0732 min read6,405 words10 facts · 0 assumptions

Start here

Executive summary

1. SUMMARY (3-5 sentences) Samuel Colvin, creator of Pydantic, demonstrated a practical workflow for optimizing AI agents in production without redeployment. Using a real-world case study of analyzing political dynasties from Wikipedia data, he showed how Pydantic AI's structured output capabilities combined with Logfire's managed variables and GEPA (Genetic Algorithm + Pareto Frontier) optimization can improve prompt performance from 85% to 96.7% accuracy. The core innovation is the ability to change prompts, models, and parameters in production environments dynamically, moving from manual tuning to continuous optimization based on real production traces and feedback signals. 2. KEY FACTS (5-10 bullets) FACT: Samuel Colvin is the creator of Pydantic and now runs Pydantic the company, which maintains Pydantic validation, Pydantic AI (agent framework), and Logfire (observability platform) | EVIDENCE: "I'm probably best known as the creator of Pydantic, the open source library. Now I run Pydantic, the company. We do Pydantic validation, Pydantic AI the agent framework, and then we have Pydantic Logfire our observability platform" | CONFIDENCE: HIGH FACT: Logfire is built on OpenTelemetry and handles logs, metrics, and traces, but is currently marketed as AI observability because that is what customers want | EVIDENCE: "Logfire is fundamentally under the hood a general observability platform open telemetry logs metrics traces I don't really believe in AI observability I think it's a feature not a category... today we sell is AI observability because that is understandably the thing that people want" | CONFIDENCE: HIGH FACT: GEPA stands for Genetic Algorithm + Pareto Frontier and is an optimization library that optimizes strings (prompts) by breeding the best performing candidates | EVIDENCE: "The name comes from genetic pareto. So it's a genetic algorithm... it looks for the best value of some kind... the pareto bit comes from it basically takes candidate from the kind of pareto frontier of the best examples it has" | CONFIDENCE: HIGH FACT: Managed variables in Logfire can be any object definable with a Pydantic model, not just text prompts | EVIDENCE: "We take prompt management one step further and we have managed variables. So they don't have to be just text. They can be effectively any object that you can define with a Pydantic model can be managed inside inside Logfire" | CONFIDENCE: HIGH FACT: The political dynasty analysis case study found that 24% of UK MPs have ancestors who were politicians | EVIDENCE: "I think it might have been 24% of MPs have some kind of ancestor who is a politician" | CONFIDENCE: MEDIUM (speaker says "I think") FACT: The initial simple prompt achieved 85% accuracy on the golden dataset, the expert prompt achieved 92%, and GEPA optimization achieved 96.7% | EVIDENCE: "overall performance of 85%... accuracy being kind of the most important one here slightly higher 92% versus 87%... It has achieved a performance score of 96.7%" | CONFIDENCE: HIGH FACT: Shopify used GEPA optimization to reduce costs from $5 million/year to $60,000-$73,000/year while improving performance | EVIDENCE: "There was an example from Shopify using Gepper... they got the price down from $5 million a year to $60-$73,000 a year. And improved performance over time" | CONFIDENCE: MEDIUM (hearsay/example) FACT: The golden dataset for the political dynasty analysis was generated using Opus 4.6 and manually checked | EVIDENCE: "Truth is we ran it we just ran a similar script with the with the like with Opus 4.6 to get that data. But I've I've checked it quite a lot and it appears to be pretty much correct" | CONFIDENCE: HIGH FACT: GEPA was created by Lakshai, a first-year PhD student at Berkeley | EVIDENCE: "Lakshai who created it is a first year PhD student at Berkeley" | CONFIDENCE: HIGH FACT: The workshop included a live demonstration of managed variables working in a FastAPI web server without redeployment | EVIDENCE: "57:01 Demonstrating managed variables in a FastAPI web server" (timestamp in description) and discussion of changing prompts in production without restarting services | CONFIDENCE: HIGH 3. KEY IDEAS (5-10 bullets) IDEA: AI observability is a feature, not a category, and will eventually be absorbed by general observability or AI platforms | REASONING: Colvin argues that logs, metrics, and traces are fundamental infrastructure, and "AI observability" is just a current marketing angle that will dissolve as the technology matures | IMPLICATION: Companies building AI-specific observability tools need a path to general platform capabilities or risk being commoditized IDEA: Prompt optimization matters most when dealing with private data that models haven't been trained on | REASONING: Colvin notes that 98% of data is private (or even 50%), and when models encounter novel internal specifications (like bank procedures), the right context in prompts becomes critical; public data demos are inherently limited because models may already know the answers | IMPLICATION: The real value of prompt optimization frameworks will be realized in enterprise settings with proprietary data, not public benchmarks IDEA: Genetic algorithm optimization for prompts is "relatively crude" compared to model complexity but effective | REASONING: Colvin explicitly states the technique is not groundbreaking—it's essentially "ask an agent to generate a new prompt, if it does better, take bits of that"—yet it achieves measurable improvements | IMPLICATION: Practical AI optimization doesn't require theoretical breakthroughs; iterative, evolutionary approaches can deliver production value IDEA: Evals require a "golden" reference or deterministic judgment criteria, and defining "right" becomes harder as models get more capable | REASONING: The speaker discusses how human annotation, code execution, or feedback loops can serve as eval baselines, but acknowledges edge cases like smoking cessation where the true eval (death rate) is impossible to measure directly | IMPLICATION: The eval design problem is becoming the bottleneck for agent improvement, not model capability or prompt engineering IDEA: Prompt optimization is model-specific and must be re-run when changing models | REASONING: Multiple audience questions confirm this, and Colvin agrees it's a major practical challenge—"that's one of the reasons that most people don't run evals" | IMPLICATION: Organizations need automated, continuous optimization pipelines rather than one-time prompt tuning, because model versions change frequently IDEA: Variance reduction in evals requires running cases multiple times, which hedge funds do at ~$20,000/night | REASONING: Colvin mentions this as current practice for reducing noise in performance measurement | IMPLICATION: Production agent optimization is becoming a compute-intensive, ongoing operational cost rather than a one-time setup task 4. KEY QUOTES (3-7 bullets) "I don't really believe in AI observability I think it's a feature not a category and it will get eaten by either observability or AI at some point but today we sell is AI observability because that is understandably the thing that people want" "People love to say that anything to do with AI is incredibly sophisticated and complicated and advanced. This optimization technique whilst the state-of-the-art is not actually that groundbreaking relative to the complexity of the model itself. It's a like relatively crude sense of like ask an agent to generate a new prompt. If it does better, take bits of that, put it into a new prompt, keep doing that" "I think that in the case where you're trying to basically get a dumber model or a faster model or a cheaper model to be able to do some task, they make a lot of sense. If you go and take the state-of-the-art models are like Opus 4.6 and you ask it most questions, it will just if it has all the information it needs, it will just go and get it right for the most part" "The statistic is that 98% of data is private... When you have a private set of data where the models have not been trained on it, you have some massive internal spec for how you're supposed to operate as a bank, let's say, adding the right bits of context into the system prompt or into the instructions is incredibly valuable" "That's one of the reasons that eval are hard and that's one of the reasons that most people don't run evals and don't run optimization they write out a decent prompt they ask their coding agent of choice does this prompt look good if it says yes they kind of eyeball it and then they put it into prod" 5. SIGNAL POINTS (5-8 bullets) Managed variables allow changing prompts, models, and parameters in production without redeployment or service restart—this is the core infrastructure shift from static to dynamic agent configuration GEPA optimization improved accuracy from 92% (expert human prompt) to 96.7% (optimized prompt) on the political dynasty task, demonstrating that automated optimization can outperform manual prompt engineering Shopify's reported cost reduction ($5M to ~$67K annually) using GEPA with a cheaper model shows the economic case for prompt optimization is not marginal—it's potentially transformative The "golden dataset" problem: without a reliable ground truth, evals become "lunatics running the asylum" (LLM-as-judge); deterministic, domain-specific evaluators are preferred but harder to build Prompt optimization is model-specific and must be re-run when models change; this creates operational friction that most teams currently avoid by "vibes-based" deployment Private data is where prompt optimization creates real value—public benchmarks are limited because state-of-the-art models may already know the answers; enterprise internal data is the actual battlefield The future direction is optimizing across the full agent configuration space (model, compaction strategy, tool registration, code mode), not just prompts 6. SOURCES MENTIONED Pydantic - Open source validation library; maintained by Pydantic company Pydantic AI - Agent framework built by Pydantic company Logfire - Observability platform by Pydantic; supports managed variables, evals, and prompt management GEPA (Genetic Algorithm + Pareto Frontier) - Optimization library created by Lakshai, first-year PhD student at Berkeley; used for prompt optimization DSPY - Another agent framework with optimization capabilities; described as "hideous to use because it's not type safe" but with high-caliber users Shopify - Company cited as using GEPA optimization; achieved cost reduction from $5M to $60-73K/year Opus 4.6 - Anthropic model used to generate the golden dataset for the political dynasty analysis GPT-4.1 - OpenAI model used in the workshop for speed; described as "flying because not many people are using it" "The Rest is Politics" - UK podcast that inspired the political dynasty case study Pydantic Gateway - AI gateway service allowing single API key access to multiple model providers (Anthropic, OpenAI, Grok, Gemini) with observability, caching, and fallback Beautiful Soup - Python library used for HTML text extraction in the Wikipedia scraping pipeline FastAPI - Web framework used to demonstrate managed variables in production 7. VERDICT This video is worth watching for practitioners building production agent systems, particularly those evaluating Pydantic AI or considering how to operationalize prompt optimization. The unique signal is the concrete demonstration of a full pipeline—from static prompt to managed variables to automated GEPA optimization—showing measurable accuracy gains (85% → 96.7%) and real cost reduction data (Shopify example). Colvin's candor about the limitations ("relatively crude" optimization, model-specific prompts, the private data problem) adds credibility absent from most vendor presentations. The video is less valuable for those seeking theoretical advances or working exclusively with state-of-the-art models on public data, where Colvin acknowledges optimization matters less. The core differentiator is the managed variables infrastructure enabling production changes without redeployment—a practical operational capability that most agent frameworks lack. --- Count: Facts: 10, Assumptions: 2 (98% private data statistic; Shopify cost figures as reported example), Demonstrations: 4 (live eval run, comparison view, GEPA optimization, managed variables in FastAPI) Signal density: 75% — High ratio of concrete demonstrations, specific numbers, and acknowledged limitations to marketing content; some time spent on workshop setup and audience Q&A reduces density slightly.

What matters

Signal points

1
Managed variables allow changing prompts, models, and parameters in production without redeployment or service restart—this is the core infrastructure shift from static to dynamic agent configuration
2
GEPA optimization improved accuracy from 92% (expert human prompt) to 96.7% (optimized prompt) on the political dynasty task, demonstrating that automated optimization can outperform manual prompt engineering
3
Shopify's reported cost reduction ($5M to ~$67K annually) using GEPA with a cheaper model shows the economic case for prompt optimization is not marginal—it's potentially transformative
4
The "golden dataset" problem: without a reliable ground truth, evals become "lunatics running the asylum" (LLM-as-judge); deterministic, domain-specific evaluators are preferred but harder to build
5
Prompt optimization is model-specific and must be re-run when models change; this creates operational friction that most teams currently avoid by "vibes-based" deployment
6
Private data is where prompt optimization creates real value—public benchmarks are limited because state-of-the-art models may already know the answers; enterprise internal data is the actual battlefield
7
The future direction is optimizing across the full agent configuration space (model, compaction strategy, tool registration, code mode), not just prompts
8
6. SOURCES MENTIONED

Interpretation

Key ideas

AI observability is a feature, not a category, and will eventually be absorbed by general observability or AI platforms

Why: Colvin argues that logs, metrics, and traces are fundamental infrastructure, and "AI observability" is just a current marketing angle that will dissolve as the technology matures

Implication: Companies building AI-specific observability tools need a path to general platform capabilities or risk being commoditized

Prompt optimization matters most when dealing with private data that models haven't been trained on

Why: Colvin notes that 98% of data is private (or even 50%), and when models encounter novel internal specifications (like bank procedures), the right context in prompts becomes critical; public data demos are inherently limited because models may already know the answers

Implication: The real value of prompt optimization frameworks will be realized in enterprise settings with proprietary data, not public benchmarks

Genetic algorithm optimization for prompts is "relatively crude" compared to model complexity but effective

Why: Colvin explicitly states the technique is not groundbreaking—it's essentially "ask an agent to generate a new prompt, if it does better, take bits of that"—yet it achieves measurable improvements

Implication: Practical AI optimization doesn't require theoretical breakthroughs; iterative, evolutionary approaches can deliver production value

Evals require a "golden" reference or deterministic judgment criteria, and defining "right" becomes harder as models get more capable

Why: The speaker discusses how human annotation, code execution, or feedback loops can serve as eval baselines, but acknowledges edge cases like smoking cessation where the true eval (death rate) is impossible to measure directly

Implication: The eval design problem is becoming the bottleneck for agent improvement, not model capability or prompt engineering

Prompt optimization is model-specific and must be re-run when changing models

Why: Multiple audience questions confirm this, and Colvin agrees it's a major practical challenge—"that's one of the reasons that most people don't run evals"

Implication: Organizations need automated, continuous optimization pipelines rather than one-time prompt tuning, because model versions change frequently

Variance reduction in evals requires running cases multiple times, which hedge funds do at ~$20,000/night

Why: Colvin mentions this as current practice for reducing noise in performance measurement

Implication: Production agent optimization is becoming a compute-intensive, ongoing operational cost rather than a one-time setup task

Evidence

Key facts

Samuel Colvin is the creator of Pydantic and now runs Pydantic the company, which maintains Pydantic validation, Pydantic AI (agent framework), and Logfire (observability platform)

HIGH

Evidence: I'm probably best known as the creator of Pydantic, the open source library. Now I run Pydantic, the company. We do Pydantic validation, Pydantic AI the agent framework, and then we have Pydantic Logfire our observability platform

Logfire is built on OpenTelemetry and handles logs, metrics, and traces, but is currently marketed as AI observability because that is what customers want

HIGH

Evidence: Logfire is fundamentally under the hood a general observability platform open telemetry logs metrics traces I don't really believe in AI observability I think it's a feature not a category... today we sell is AI observability because that is understandably the thing that people want

GEPA stands for Genetic Algorithm + Pareto Frontier and is an optimization library that optimizes strings (prompts) by breeding the best performing candidates

HIGH

Evidence: The name comes from genetic pareto. So it's a genetic algorithm... it looks for the best value of some kind... the pareto bit comes from it basically takes candidate from the kind of pareto frontier of the best examples it has

Managed variables in Logfire can be any object definable with a Pydantic model, not just text prompts

HIGH

Evidence: We take prompt management one step further and we have managed variables. So they don't have to be just text. They can be effectively any object that you can define with a Pydantic model can be managed inside inside Logfire

The political dynasty analysis case study found that 24% of UK MPs have ancestors who were politicians

MEDIUM (speaker says "I think")

Evidence: I think it might have been 24% of MPs have some kind of ancestor who is a politician

The initial simple prompt achieved 85% accuracy on the golden dataset, the expert prompt achieved 92%, and GEPA optimization achieved 96.7%

HIGH

Evidence: overall performance of 85%... accuracy being kind of the most important one here slightly higher 92% versus 87%... It has achieved a performance score of 96.7%

Shopify used GEPA optimization to reduce costs from $5 million/year to $60,000-$73,000/year while improving performance

MEDIUM (hearsay/example)

Evidence: There was an example from Shopify using Gepper... they got the price down from $5 million a year to $60-$73,000 a year. And improved performance over time

Show 3 more facts

The golden dataset for the political dynasty analysis was generated using Opus 4.6 and manually checked

HIGH

Evidence: Truth is we ran it we just ran a similar script with the with the like with Opus 4.6 to get that data. But I've I've checked it quite a lot and it appears to be pretty much correct

GEPA was created by Lakshai, a first-year PhD student at Berkeley

HIGH

Evidence: Lakshai who created it is a first year PhD student at Berkeley

The workshop included a live demonstration of managed variables working in a FastAPI web server without redeployment

HIGH

Evidence: 57:01 Demonstrating managed variables in a FastAPI web server" (timestamp in description) and discussion of changing prompts in production without restarting services

Memorable lines

Quotes

“I don't really believe in AI observability I think it's a feature not a category and it will get eaten by either observability or AI at some point but today we sell is AI observability because that is understandably the thing that people want”

“People love to say that anything to do with AI is incredibly sophisticated and complicated and advanced. This optimization technique whilst the state-of-the-art is not actually that groundbreaking relative to the complexity of the model itself. It's a like relatively crude sense of like ask an agent to generate a new prompt. If it does better, take bits of that, put it into a new prompt, keep doing that”

“I think that in the case where you're trying to basically get a dumber model or a faster model or a cheaper model to be able to do some task, they make a lot of sense. If you go and take the state-of-the-art models are like Opus 4.6 and you ask it most questions, it will just if it has all the information it needs, it will just go and get it right for the most part”

“The statistic is that 98% of data is private... When you have a private set of data where the models have not been trained on it, you have some massive internal spec for how you're supposed to operate as a bank, let's say, adding the right bits of context into the system prompt or into the instructions is incredibly valuable”

“That's one of the reasons that eval are hard and that's one of the reasons that most people don't run evals and don't run optimization they write out a decent prompt they ask their coding agent of choice does this prompt look good if it says yes they kind of eyeball it and then they put it into prod”

“5. SIGNAL POINTS (5-8 bullets)”