Evidence Surfaces for Agents Outside the Prompt
Published:
TL;DR: the newest May 1-3 arXiv window was thin for the tracked topics, so I expanded to the freshest April 30 papers after deduplicating the existing Paper Radar list. I picked three papers that all move evidence out of the prompt: Claw-Eval-Live refreshes workflow-agent evaluation from live demand signals and grades observable action, ObjectGraph treats documents as traversable typed graphs rather than injected strings, and CIRM edits reward-model activations at inference time so style shortcuts are less likely to become training labels.
What I Am Watching This Round
The last few issues already spent a lot of time on workboards, checkpoints, trace monitors, and pre-final drift checks. I did not want another issue that says “agents need state” and then adds four more benchmarks. The more specific question this round is: where does the evidence live when an agent leaves the chat box?
For workflow agents, evidence has to live in service state, audit logs, JSONL traces, and post-run artifacts. For document agents, it has to live in the file format itself, not only in a retrieval layer bolted on afterward. For agentic training, reward evidence has to survive the reward model’s own shortcut preferences before it becomes DPO data.
I screened Claw-Eval-Live, ObjectGraph, FlashRT, xmemory, SpecVQA, indirect prompt-injection work, multivector retrieval, and mechanism/data-selection papers. I kept three because their open HTML had enough method detail, figures, and numbers for a real mini explainer. I also followed the recent feedback: each section starts with a field on-ramp, dense tables are rewritten as Markdown, and the mechanism paper gets its equations rather than a vague “they intervene on neurons” summary.
Paper Notes
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
Authors: Chenxin Li, Zhengyang Tang, Huangxin Lin, Yunlong Lin, Shijue Huang, Shengyuan Liu, Bowen Ye, Rang Li, Lei Li, Benyou Wang, Yixuan Yuan.
Institutions: The Chinese University of Hong Kong; The Chinese University of Hong Kong, Shenzhen; South China University of Technology; Xiamen University; The Hong Kong University of Science and Technology; Peking University; The University of Hong Kong.
Date/Venue: April 30, 2026, arXiv preprint.
Links: arXiv | HTML | project

This overview is the right entry point because it shows the benchmark as two linked systems: a demand signal pipeline and an execution/evaluation loop. The released tasks are not just prompts; they are workflows with mock services, local workspaces, traces, graders, and state-changing outcomes. The caveat is that the live signal source is still a public workflow proxy, so it may miss private enterprise tasks that never appear in marketplace or public skill data.

The heatmap is more informative than a single leaderboard row. It shows that model failures cluster by workflow family: HR, management, and multi-system business workflows remain harder than some local repair tasks. I read this as evidence that “agent ability” is not one scalar; the service surface and audit trail matter as much as the base model.

This plot puts pass rate and overall completion score in the same frame. The leading models are close, but no model crosses the 70% pass-rate line in the public release. The paper also reports 19 all-pass and 27 all-fail tasks under the public rule, which is a useful warning: pass rate should be read with completion score and task-level discrimination, not as a clean total ordering.
Quick idea: Claw-Eval-Live tries to evaluate workflow agents against what users currently want automated, while grading what the agent actually did in controlled services and workspaces.
Why it matters: agent benchmarks age quickly. A static task set can be valuable for reproducibility, but it becomes stale when the workflows people automate change, when services shift, and when models learn the quirks of the released benchmark. The paper’s answer is to separate a refreshable signal layer from a time-stamped reproducible release. That distinction is important: you want the benchmark to follow demand without making last month’s scores meaningless.
Method walkthrough:
- The current release starts from public workflow-demand signals, including the ClawHub Top-500 skills snapshot, and clusters them into stable workflow patterns.
- The authors compress 33 workflow patterns into 6 signal families, assign family weights, and expand 24 task seeds into 178 candidate tasks.
- They select 105 released tasks with discrimination and family coverage constraints. The release has 87 service-backed workflow tasks and 18 local workspace-repair tasks.
- Each task runs with fixed fixtures, controlled services or workspaces, JSONL traces, audit logs, post-run artifacts, and a grader. Deterministic checks are used when evidence is sufficient; structured LLM judging is reserved for semantic dimensions and is grounded in traces and rubrics.
Public leaderboard slice from the paper.
| Model | Pass rate | Pass count | Overall completion |
|---|---|---|---|
| Claude Opus 4.6 | 66.7% | 70 / 105 | 83.6 |
| GPT-5.4 | 63.8% | 67 / 105 | 81.7 |
| Claude Sonnet 4.6 | 61.9% | 65 / 105 | 79.9 |
| GLM-5 | 61.9% | 65 / 105 | 78.1 |
| Doubao Seed 2.0 | 43.8% | 46 / 105 | 70.4 |
The main result is not that one model wins. The main result is that the best model still fails roughly one third of the release, and the failures are structured. The efficiency table adds a deployment angle: GPT-5.4 uses 1.26M tokens, costs an estimated $6.27, and takes 104 minutes across the release, while Claude Opus 4.6 uses 3.32M tokens, costs $31.61, and takes 213 minutes. Those are estimated API costs from recorded token use and provider list prices, not total experiment spend, but they make the benchmark feel closer to an operations question.
My judgment: I like the benchmark because it asks for evidence before it asks for prose. Service-backed workflows are evaluated through tool traces, service audit logs, fixtures, and rubrics; workspace repair is evaluated through command traces, post-run state, artifacts, and tests. The weaker point is judge dependence: GPT-5.4 is used for semantic judging and is also one of the evaluated models. The authors limit that risk by grounding judge inputs in traces, but I would still treat model-judge scores as audit-assisted estimates, not independent human adjudication.
Connection to tracked themes: data agents, workflow agents, live evaluation, auditable execution.
ObjectGraph: From Document Injection to Knowledge Traversal
Authors: Mohit Dubey; Open Gigantic.
Institutions: Open Gigantic; other affiliations not specified.
Date/Venue: April 30, 2026, arXiv preprint.
Links: arXiv | HTML

This figure captures the paper’s systems claim. For small files, the agent can read an index once and call resolve_context; for large files, a router agent selects node IDs and an executor receives only the resolved payload. The important detail is the missing shared history in Architecture B: the executor is protected from context compounding by design, not by summarization after the fact.

The token-cost plot shows why this is not only a file-format proposal. Across skill files, runbooks, execution plans, technical docs, and knowledge bases, ObjectGraph sends much less text to the model than Markdown injection or RAG. The caveat is that the benchmark documents are curated and the format is new, so the reported savings are a strong proof of direction rather than evidence of broad ecosystem adoption.

This curve is the most agent-specific result. Markdown grows badly over a five-turn workflow because each read gets carried forward in the conversation history; ObjectGraph Architecture B stays near-linear because only the relevant payload is passed to the executor. The paper reports 46,000 cumulative tokens for Markdown at turn 5 versus 1,260 for Architecture B, a 36.5x reduction.
Quick idea: ObjectGraph argues that agent documents should be traversed as typed graphs, not injected as linear text, and proposes .og as a Markdown-compatible file format with native nodes, edges, access scopes, assertions, and changelogs.
Why it matters: a lot of agent memory infrastructure treats documents as strings and then adds retrieval. That works for search, but it does not encode who should see which node, which procedure step depends on which warning, whether a task succeeded, or what changed since the last turn. If an agent repeatedly reads the same runbook in a multi-turn loop, the cost is not just tokens. The irrelevant context can dilute attention and leak information across roles.
Method walkthrough:
- ObjectGraph makes a document a typed directed graph. A node can carry summary, dense content, steps, warnings, code, assertions, and metadata; edges encode prerequisites, references, alternatives, and conditional paths.
- The format is a strict superset of Markdown: every Markdown file remains valid
.og, but.ogadds::index, node IDs, content-type tags, role scopes, assertions, and changelog blocks. - The query protocol has two primitives:
search_indexto find relevant nodes andresolve_contextto retrieve the payload. Architecture B uses a router/executor split so the executor never receives router history. - A Markdown-to-ObjectGraph transpiler uses deterministic structural extraction, bounded LLM metadata synthesis, and fidelity verification.
Task accuracy slice from the paper.
| Task type | Markdown | RAG | ObjectGraph | ObjectGraph with explicit edges |
|---|---|---|---|---|
| Information lookup | 91.2 | 87.4 | 92.1 | 92.3 |
| Procedure execution | 88.6 | 83.1 | 89.4 | 90.1 |
| Role-conditional access | 76.4 | 71.2 | 94.8 | 95.1 |
| Cross-node reasoning | 82.1 | 74.6 | 77.9 | 80.3 |
| Update detection | 61.3 | 54.7 | 91.4 | 91.6 |
| Assertion verify | 52.8 | 48.1 | 96.3 | 96.5 |
| Mean | 76.0 | 71.0 | 90.1 | 90.8 |
The table is the evidence I care about most. ObjectGraph helps most where the format adds something Markdown lacks: role-conditional access, update detection, assertion verification, and multi-agent handoff. The weak row is cross-node reasoning, where full Markdown injection still gives implicit context; explicit edges reduce but do not remove that gap.

The ablation separates the format features rather than treating .og as a single magic wrapper. Index routing and the dense layer account for most of the token savings, with skip-if-known, role scoping, and delta loading adding smaller pieces. This supports the paper’s claim that the gain comes from native document structure, not only from shorter serialization.
The evaluation uses 240 documents across five classes, eight task types, and five executions per document-task pair. Documents range from 200 to 15,000 tokens, with a mean of 2,340 and a median of 1,680. The paper reports mean token consumption dropping from 2,340 to 187 tokens, a 92.0% reduction, and also reports up to 95.3% token reduction without statistically significant accuracy degradation. Transpiler fidelity reaches a mean of 0.987 on 180 held-out documents, and a small user study with 18 participants reports authoring burden of 2.8/7.
My judgment: this is a useful paper even if .og never becomes the final standard. It identifies a real product problem: agent-readable documents need access control, dependency structure, executable assertions, and delta loading inside the file surface. I would be cautious about the standardization story and the current lack of cross-file edge resolution. The next hard test is not whether a single runbook can be converted; it is whether a team can maintain hundreds of .og files without breaking graph links or human authoring discipline.
Connection to tracked themes: document intelligence, data agents, agent memory, RAG infrastructure.
Debiasing Reward Models via Causally Motivated Inference-Time Intervention
Authors: Kazutoshi Shinoda, Kosuke Nishida, Kyosuke Nishida.
Institutions: Human Informatics Labs, NTT, Inc.
Date/Venue: ACL 2026 Main Conference; arXiv version April 30, 2026.
Links: arXiv | HTML

This overview shows the method in two stages: find neurons whose activations correlate with spurious style features, then replace those neuron activations with validation-set medians at inference time. The target is not only length bias; the paper studies length, paragraph count, lexical overlap, exclamation marks, and Markdown bold text. The caveat is that the biases are predefined and style-heavy, so this is not a general truthfulness detector.

The causal graph is small, but it is the technical core. The paper treats the input as the treatment, the reward as the outcome, and bias-specific neuron activations as a mediator. The intervention tries to score a pair as if the two responses had the same mediator value, so style shortcuts do not dominate the reward difference.
Quick idea: CIRM finds reward-model neurons associated with style shortcuts and clamps them to median values at inference time, estimating a controlled direct effect rather than the ordinary reward difference.
Why it matters: reward models turn preferences into training data. If the reward model likes long answers, bold text, exclamation marks, or superficial overlap, that preference can move into DPO data and eventually into the policy. Length penalties are a blunt fix; they can punish useful detail. This paper asks whether we can remove the shortcut inside the reward model while leaving the rest of the reward signal intact.
The starting point is the Bradley-Terry reward model:
\[p(y_1 \succ y_2 \mid q)=\sigma(r_\theta(x_1)-r_\theta(x_2)).\]CIRM rewrites the comparison in causal terms. If (m(x)) is the activation vector of bias-specific neurons, the ordinary reward difference estimates a total effect:
\[\widehat{\mathrm{TE}} = r_\theta(x_1,m(x_1))-r_\theta(x_2,m(x_2)).\]The intervention instead fixes the mediator to a validation-set median (m^*):
\[\widehat{\mathrm{CDE}} = r_\theta(x_1,m^\*)-r_\theta(x_2,m^\*).\]Method walkthrough:
- On 500 RewardBench validation instances, the method computes each reward-model neuron’s last-token activation and each predefined spurious feature value.
- For each bias, it ranks neurons by Spearman’s (\rho) and selects the top and bottom (k) neurons as bias-specific. The paper tunes (k) over ({50,100,200,500,1000,2000,5000}) across five bias types.
- At inference time, it replaces those activations with median values from the validation set, editing less than 2% of all neurons in the tested reward models.
- The debiased reward model is evaluated directly on RewardBench and RM-Bench, then used to annotate preference data for DPO, with downstream evaluation on AlpacaEval 2.0, MT-Bench, and TruthfulQA.

The neuron histogram shows that bias-specific neurons are not uniformly spread across the model. For the GRM reward model, the top and bottom Spearman-correlated neurons for length concentrate in particular layers. This supports the paper’s mechanistic claim that style shortcuts have a localized enough footprint for targeted intervention, though localization does not prove the neurons represent only that bias.
Downstream DPO evaluation slice from the paper.
| Base policy and annotation RM | AlpacaEval 2.0 LCWR | AlpacaEval 2.0 WR | Mean length | MT-Bench |
|---|---|---|---|---|
| Llama-3-8B-Instruct, no DPO | 26.09 | 32.06 | 1968 | 7.34 |
| Llama-3-8B + GRM | 37.53 | 47.47 | 2193 | 7.45 |
| Llama-3-8B + GRM + CIRM | 41.89 | 50.13 | 2201 | 7.53 |
| Llama-3-8B + FsfairX | 37.78 | 49.74 | 2368 | 7.64 |
| Llama-3-8B + FsfairX + CIRM | 39.49 | 51.19 | 2345 | 7.62 |
| Gemma-2-9B-it + FsfairX | 55.32 | 58.56 | 1931 | 8.07 |
| Gemma-2-9B-it + FsfairX + CIRM | 57.76 | 60.52 | 1923 | 8.15 |
| Gemma-2-9B-it + INF 70B RM | 58.98 | 61.51 | 1919 | 8.11 |
The numbers are not a giant leaderboard leap, and that is partly why I like the paper. The useful result is that a small reward model with CIRM can improve both length-controlled win rate and raw win rate in several settings without the obvious over-shortening seen in length-only penalties. The comparison with INF is also practical: a 2B or 7B reward model with targeted intervention can get close to a 70B reward model as a preference annotator in these evaluations.
My judgment: CIRM is best read as reward-model hygiene for agentic training. It does not solve reward hacking, and it only covers biases that the authors define in advance. But it gives a concrete way to ask whether a reward model is training the policy on helpfulness or on formatting habits. The next question I would track is whether the same idea can be extended from style features to tool-use artifacts: citation shape, JSON verbosity, confident but unverified claims, or traces that look clean while hiding wrong actions.
Connection to tracked themes: large model mechanisms, reward models, agentic training, interpretability-guided post-training.
Reading Priority and Next Questions
I would read ObjectGraph first if the goal is product design for data/document agents. Even if the exact .og syntax changes, the file-format argument is useful immediately. I would read Claw-Eval-Live first if the goal is benchmark design or agent evaluation. CIRM is the one I would keep closest to post-training work: it gives a small, testable intervention point between reward-model shortcuts and policy updates.
Next questions I want to follow: can live workflow benchmarks expose failure evidence without becoming judge-model benchmarks, can document formats support cross-file graph links without making authoring miserable, and can reward-model interventions target deployment artifacts rather than only surface style features?