Evaluation as you build it

Starter page — expand with a sample eval report and the harness interface.

The distinctive thing about building context through an interview is that the measurement comes for free. Every disambiguation the analyst makes (“active client means X, not Y”) is at once a context entry and a labeled eval pair. Building context is harvesting ground truth.

The local eval delta

After you define a domain, run the delta to see how much the context helped:

"Run the eval delta on session-financials."

The open-source eval_harness/ runs your agent with the context and without it against the harvested pairs and reports the accuracy difference — a concrete number you can show.

Format-agnostic

The harness reads ACF, dbt models and docs, or raw markdown, normalizes them, and measures the delta the same way — so you can evaluate context you already have, not just ACF.

From one-shot to continuous

The one-shot, run-locally eval delta is free. Continuous re-evaluation, drift detection, and observability across a team are the hosted product — see enterprise evaluation.

Build your context Self-host

⌘I

​The local eval delta

​Format-agnostic

​From one-shot to continuous

The local eval delta

Format-agnostic

From one-shot to continuous