Introducing Span's AI Effectiveness suite, powered by agent traces

Introducing Span's AI Effectiveness suite,
powered by agent traces

Stack Trace Podcast

Insights

Treat Your Coding Agents Like a Production System

Treat Your Coding Agents Like a Production System

Span Team

Assembled co-founder and CTO John Wang talks with Stephen Poletto about why he treats coding agents like a production AI system, how a best-model mandate turned into real cost discipline, and the internal platform Assembled built to put agents in everyone's hands. An audio version of this interview is available on Spotify.

Three Takeaways

  • Treat coding agents like a production AI system. Assembled has run AI in production for years, so it applies the same rigor to its own dev tooling: evals, deliberate model choices, and real cost discipline.

  • Cost discipline is a muscle worth building early. Assembled deliberately ran the best available models on everything at first. Over several months, the team built the harder skill of matching each task to the model it actually needs.

  • The unlock is lowering activation energy, not adding tools. Assembled's internal platform runs coding agents in sandboxes anyone can reach with a login, gives them one well-scoped tool instead of a sprawl of integrations, and gates risky changes by role and code path.

Introduction

As co-founder and CTO of Assembled, John Wang has spent years building the systems that route customer support work between AI and humans, for companies like DoorDash, Stripe, and Robinhood. The Assembled team treats its coding agents the way it treats the AI it ships to customers: with evals, deliberate model choices, real cost discipline, and a platform built to make the agents easy to use.

The Same Problem, One Layer Down

Assembled's core product decides which support issues an AI agent can resolve and which need a person, and makes sure that when a high-value customer escalates, someone is available right away. Wang sees engineering heading the same way: working out which tasks go to agents, which need a human, and when a cheaper model will do.

But not every part of the codebase gets the same treatment. Core systems, the ones that keep customers' support operations running, demand high reliability and few bugs. New builds get a sandbox where engineers can experiment and run tests.

Determine the Best Models for Each Task

Assembled's early rule for its production AI was deliberate: use the best available model on everything and don't worry about cost. When you are shipping AI agents to customers, intelligence is what matters most, so reaching for the latest model was the right call at the time.

But the best model is rarely the right model for every task. As usage grew, the team realized it was spending heavily, token by token, on work that did not need the latest models. In many cases, the money spent translated into very little extra gains for the overall product.

The fix was less about swapping models than about rebuilding an instinct on the team: stopping to ask whether a task really needs the latest model. It took four or five months to build that back into a team that had been trained to always reach for the best, and it pays off more as the number of model choices grows.

Wang is blunt about where this goes wrong more broadly. He thinks tokenmaxxing, treating tokens burned as a proxy for adoption, is "absolutely stupid." His objection is not to using AI heavily but to the metric, which rewards the wrong thing. When the goal becomes tokens burned, you end up rewarding what he calls "slop cannons," people who ship volume instead of quality.

Give Agents the Right Tools, Not All of Them

Assembled built an internal system that runs coding agents like Claude Code and Codex in sandboxes anyone can reach with a login. The point was to lower activation energy. A non-engineer used to need hours to set up a local dev environment before writing a single line of code. Now anyone on a team can kick off an agent, which has empowered even non-engineers to become builders.

However, builders are required to run a review pass before they can even open a PR, and the system can check whether a change is reasonably scoped or touching a heavily used code path, since Assembled already tracks how often each endpoint gets hit. The earliest version of this was a doc full of manual steps, and Wang is candid that its main effect was making it unlikely anyone shipped at all. The constraints have relaxed since, but the principle held: a builder still has to run the code and see the change do the right thing.

Additionally, Assembled initially wired up an MCP server, then pulled it back off, because connecting it to everything blew up the agent's context. The replacement is a single CLI that exposes a tight set of sub-commands, so access is prescriptive rather than sprawling. For the logging system, the agent can fetch, search, and filter logs and nothing else. Narrowing what the agent can touch has worked better than the sprawl of an MCP wired to everything.

Because the platform is shared, automations stop being a one-person thing. A shared platform the whole team can see and adjust makes those automations worth building. The one Wang is most excited about is still in progress: a self-improving loop where an agent reads past sessions, spots the corrections the team keeps making, and opens a PR proposing an update to AGENTS.md for the file's owner to review.

Applying Evals to Agents

The clearest production lesson Assembled is now applying to its own agents is evals. On the product side, the team invested heavily in getting them right, because a bad eval is worse than no eval: it makes you confident in the wrong thing.

Now they are building evals for the coding agents themselves. The setup is simple on purpose: a snapshot of the codebase, a prompt, and notes for an LLM judge on what a good implementation looks like. It runs in CI whenever someone changes an AGENTS.md file, so changes to how the agents behave actually get checked. The bar Wang uses is would this PR have been shippable before 2025?

That same focus on outcomes over activity shows up elsewhere. Hiring now screens harder for product judgment, since deciding what to build and reasoning well about trade-offs has become the scarce skill, and the team's Friday demos reward showing an eval with a before-and-after impact graph rather than a slick demo of something new.

What This Means for Leaders

The thing Wang is most excited about sounds mundane: steadily lowering the activation energy it takes to get from a Linear issue to a reviewed PR. That is the throughline of everything Assembled has built internally. Setting up CI/CD used to be the table-stakes investment for a new engineer, and Wang argues that setting up your agent system is now the same kind of investment. Agents are a new kind of teammate, and they need what a new engineer needs: low friction, the right tools, and a clear path to shipping something a human can review.

The hard parts of this shift, in Wang's telling, are not the models but the disciplines around them: knowing which work goes where, spending deliberately, measuring outcomes rather than activity, and treating agents like production software that deserves a real platform. The teams that have already run AI in production have a head start on all of it.

For more episodes of Stack Trace, subscribe to Span's YouTube channel or to the Stack Trace podcast on Spotify.

Everything you need to unlock engineering excellence

Everything you need to unlock engineering excellence