Introducing Span's AI Effectiveness suite, powered by agent traces

Introducing Span's AI Effectiveness suite,
powered by agent traces

Coding Agent Harness Effectiveness

Coding Agent Harness Effectiveness

Stephen Poletto

Generating code is nearly free now, yet most teams still measure engineering the same way they always have: by the rate at which code gets produced. Throughput still matters, but on its own it ignores what makes software development expensive, like how much of that output gets reworked, how much human attention it burns in review, and how often it breaks in production. What you really want to know is whether your system is accurate.

As I argued in my last post, that comes down to your harness, not the model. So the question this post answers is: how do you actually measure whether your harness is working?

The SDLC

You can think of the software development process as a series of stages:

  1. Specifying what needs to be built.

  2. Defining a technical specification to support those requirements.

  3. Breaking the technical specification down into modular units of work.

  4. Coding those modular units of work.

  5. Reviewing each modular unit for architectural adherence, correctness, and maintainability.

  6. Integrating everything into a cohesive package, and running user acceptance testing over the whole thing.

  7. Deploying - first gradually, then broadly - to verify that you have solved the customer problem well.

Of course, no software development process is actually this linear. There is often iteration throughout the whole process. It's very rare that you know all of the specifications up front, and you workshop and develop the thinking as you go.

That said, it is a useful mental model for the process of bringing software to life. 

Agents create an opportunity to automate several portions of this process. Today, we're already seeing agents do a good job of automating the code generation part, and teams across the industry are starting to experiment with agents that turn product requirements into technical specs, technical specs into tickets, and verification systems that automatically generate and execute higher-level end-to-end, or user acceptance tests.

How then do we reason about the effectiveness of our harness that sits over the top, trying to steer these agents to produce better outcomes?

What is a Harness?

First off, let's define what a harness is. A harness is a set of deterministic software that sits on top of probabilistic models. If you have agents doing various parts of this SDLC process, even when provided the same exact input, there's no guarantee that they'll generate the same exact output. Because they are probabilistic, across multiple runs with the same exact input conditions, you'll get different results.

The harness then is the set of software that creates a system of checks and balances against the probabilistic model, such that the system is unable to proceed to the next step unless certain deterministic factors are satisfied.

For example, If you want to guarantee that a specific user scenario never breaks, you run a test that checks against that user scenario before a pull request is able to be opened. The agent fails when the unit test fails, is unable to open a pull request, and is therefore unable to progress to the next stage.

In the human-driven SDLC, we would evaluate software engineers based on:

  • how many pull requests they were producing

  • how many modular units of value they were shipping

  • roughly how often they needed course correction from a tech lead or a manager via providing peer-to-peer feedback, or due to the defects they introduced in production

If an engineer was really good at getting work through, they were doing a good job.

In the new SDLC, we need to evaluate our overall harness and the overall agentic development system as a system, not just as individual productivity data points. We need to zoom out and think about what task success looks like for getting work through this entire process.

‘Shifting Left’ is Now an Essential Principle

In general, a principle that we should incorporate into this process is “shifting left.” It is far more expensive to catch an agent's mistake in production, when it is impacting our customers, than it is to catch it at code review. Similarly, it's far more expensive to burn human attention on code reviewing and catching mistakes if the agent could have caught them themselves with better guardrails pre-PR open. We want to preserve the quality of our product and we want to preserve the attention of our human operators. These are scarce resources that are important to defend.

In this sense, we can think of this overall process as a series of opportunities for defects to escape:

  • A defect can escape into pull requests, which require humans to correct.

  • A defect can escape into production, which impacts customer experience and requires incident resolution to handle.

How do we prevent issues from getting through these checkpoints?

  1. If an agent generates a pull request, and the pull request receives human review comments, something went wrong. Those code review comments were opportunities to provide better context for the agent or better deterministic gates earlier for the agent, such that the pull request could have never been opened in the first place.

  2. If during the rollout of a new piece of software, we identify elevated error rates. That too is something that was missing from the agent context, such that it was prevented from ever opening the pull request in the first place.

  3. If the entire system generates a set of software that deviates materially from the original spec, something has gone wrong. We should be able to measure this spec drift either with user acceptance testing or an agent modeling user acceptance testing, identifying how the produced software differed from the original intention.

With this framework in mind, we can start to brainstorm new metrics of success for how we evaluate harness effectiveness:

  • Frequency of accepted code review comments (goal: reduce)

  • Frequency of defects being detected pre-release (in CI/CD, in UAT, checking for specification drift)

  • Frequency of defects being detected post-release (incidence of elevated error rates, defect counts, incident count)

These are good counter-balancing metrics to keep in mind alongside the rate at which we ship through the system (e.g. feature throughput and feature cycle time).

Conclusion

The real move here is to stop thinking about productivity at the level of individual output and start measuring performance at the level of the whole system. Every human review comment and every production incident is a signal that your harness let something slip through that it could have caught upstream. Track those escapes, drive them down, and balance them against throughput, and a process that feels unpredictable starts getting measurably more reliable over time.

That's the payoff: not just shipping faster, but building a system that defends product quality and human attention as it does.

Everything you need to unlock engineering excellence

Everything you need to unlock engineering excellence