Keeping Score: Why an AI Analytics Agent Has to Join the Grade to the Outcome

Jun 10, 2026

A forecaster who says seventy percent chance of rain every morning, and never once writes down whether it rained, is not a forecaster. He is a person with opinions. You cannot tell whether his seventy means anything until you line up a hundred of his forecasts against a hundred actual days. Maybe it rains seventy times and he is sharp. Maybe it rains forty times and his confidence is decoration. Until someone keeps score, there is no way to know.

Most trust scores on production data work exactly like that forecaster. The score ships, the agent acts on it, and nobody ever circles back to ask whether the number it trusted was worth trusting.

A grade is a prediction, not a verdict

When a system stamps a business metric “safe to act, grade B,” it is not stating a fact. It is making a forecast about a number. And a forecast is worth something only when its second half arrives, the part that has not happened yet, when the action plays out and the world says whether the call was right.

Philip Tetlock spent two decades proving this with human experts. The ones who got better were not the confident ones. They were the ones who kept score, logged each prediction, compared it to what actually happened, and moved their next estimate. The experts who never kept score stayed exactly as wrong as they started, and just as sure.

The industry is building agents that consume metrics and skipping the part Tetlock proved matters most. In LangChain’s State of AI Agent Engineering report, surveyed across more than 1,300 practitioners, 89 percent of organizations run some observability on their agents and 62 percent trace individual steps and tool calls. Fewer than half evaluate live production traffic at all. But look at what every one of those numbers grades. The agent. Its steps, its reasoning, its output. Almost none of it asks whether the number the agent trusted was worth trusting. The trace watches the agent. Nobody is scoring the number.

Zillow is the version of this that cost real money. Their home-pricing models kept buying at prices the market had already left behind, and the company took a 304 million dollar writedown in a single quarter and more than 500 million in all as it shut the business down. The data was stored perfectly the whole time. Nothing joined the confident prediction to the outcome that contradicted it until the loss had already landed.

The loop is the join

Rather than treat a trust grade as a verdict that closes the question, treat it as a prediction whose second half resolves later. The grade is issued at decision time. The truth arrives when the action’s window closes.

Two verdicts resolve at that point. The decision verdict asks whether acting was correct. The inference verdict asks whether the agent read the world correctly. They combine into four outcomes. Reinforce, when a confident grade was right. Robust, when the agent got a bad read but the action survived it. Miscalibrated, when the grade and the outcome disagree. Avoid, when a number that looked safe led somewhere it should not have.

The loop closes only when the system goes back and joins the grade it issued to the outcome that resolved, then asks the uncomfortable question. How many of the numbers I graded B and safe actually turned out to be ones I should have avoided.

That join is the entire difference between logging trust and reasoning about it. A platform can store the grade and store the outcome as two flawless, separate records and never notice they describe the same decision. Keeping score is the act of joining them. A self-improving loop is not a feature you can claim by storing more data. It is the moment the score meets reality and the weighting moves.

A confidence score nobody checks against what happened is not a trust signal. It is a forecast that never keeps score.

You can run the loop yourself. Keep score at thetruthlayer.dev/keep-score. Grade a metric, let the agent act on it, advance the clock, and watch the outcome resolve into one of the four quadrants. Do it a dozen times and a calibration readout builds underneath, of the metrics you graded safe, how many actually were. Watch the weight on each signal move as the scores meet outcomes. That movement is the self-improvement loop. Without it, the grade just drifts, confident and unchecked, like the forecaster who never wrote anything down.

Why the agent era cannot skip it

A human analyst keeps a rough score without being asked. They remember the dashboard that burned them last quarter, and they trust it a little less this quarter. The loop runs quietly in their head.

An agent has no such memory unless the architecture gives it one. An agent acts at machine speed and forgets at machine speed. If the loop does not close on its own, the agent will act on the same over-trusted number a thousand times before anyone joins the grade to the outcome. The human got one painful lesson and adjusted. The agent gets the same lesson a thousand times and adjusts never.

So the question to ask of any analytics agent is not whether it can produce a confident answer. They all can. The question is whether it keeps score on its own confidence. The system that joins the grade to the outcome and surfaces “you acted on a B that resolved to avoid” without being asked is the only one you can let run unattended.

Time to trusted action was never a one-time grade. It is a loop that has to close. The score that keeps its own score is the one a business can actually build on.

Keep score at thetruthlayer.dev/keep-score.

Alireza Rahmani Khalili

Jun 17Edited

The Zillow case is the sharpest example of what happens when confidence and feedback are stored as separate records that nobody joins. The data was perfect. The loop never closed. A 304M writedown is what unchecked calibration drift costs at scale.

The human analyst vs. agent asymmetry is the most important point. A human gets burned once and adjusts. An agent acts on the same over-trusted number a thousand times and adjusts never unless the architecture explicitly closes the loop. "Can it produce a confident answer" is the wrong question. Every agent can. The right question is whether it keeps score on its own confidence.

I write about production AI systems and distributed backends worth a subscribe here too.

Discussion about this post

Ready for more?