Running an agent team in production

The boring infrastructure problems you only meet after the demo is over.

Super Genius Labs EditorialMay 10, 2026 · 2 min readUpdated Jul 31, 2026

Model choice gets attention. The failure boundaries that keep an agent useful deserve more.

The shape of the problem

A production agent is not a chat.completions call. It is a long-running loop with:

Tool calls that mutate external systems, such as an appointment write.
Voice I/O with latency budgets and back-pressure when the network jitters.
Failure modes that are partial — the model answered correctly, but an appointment write did not land.
Observability — failures need enough structured context to diagnose without copying PHI into general-purpose logs.

You do not solve any of those with a better prompt.

What the stack actually looks like

The voice agent runs as a Python worker on Fly and connects to LiveKit for realtime sessions. It records bounded operational metrics, tool failures, and reconciliation events. Sensitive caller content is deliberately excluded from general operational alerts.

The site you're reading this on runs in Next 16 (App Router), R3F for the chip animation, GSAP + Lenis for the scroll narrative, and Tailwind 4 with a custom chrome primitives layer. None of which matters to the agent. They matter because they are the surface that brings the agent into the room.

The expensive habits we have committed to

Fail loud. A stalled provider leg or exhausted scheduling write path must degrade to a caller-visible fallback and staff reconciliation.
Version the behavior. Prompts and configuration live in the repository with the code.
Hand off. Clinical, billing, unsupported, and degraded flows route to human staff instead of asking the model to improvise.

The model is a commodity. The discipline is not.