Closed Loops for Auditable Agents

17 minute read

Published: April 30, 2026

TL;DR: This round moves from “can an agent solve a benchmark?” toward a harder question: can the loop be audited after the fact? I selected four recent open papers because they put feedback and evidence into the system boundary: FutureWorld waits for real outcomes before assigning rewards, AgentSim records verifiable RAG traces, DV-World tests data-visualization agents in native and interactive workflows, and MoRFI probes how post-training on new facts leaves sparse mechanistic traces.

What I Am Watching This Round

The last Paper Radar issue focused on verifiable state for long-horizon agents. I did not want to repeat that shape with another set of agent benchmarks and verifier papers. This time I screened recent arXiv, Hugging Face, Chinese tech-media, community, and official lab leads from April 28-30, 2026, then kept the papers where the full text and figures were accessible enough for close reading.

The strongest pattern is a move from static evaluation to closed-loop evidence. FutureWorld asks whether prediction agents can learn from events that resolve later in the real world. AgentSim asks whether RAG agents can leave behind grounded traces that people and models can validate step by step. DV-World treats data visualization as a messy professional workflow rather than a code-only plotting task. MoRFI is not an agent paper, but it matters here because post-training is the place where many agents acquire new behavior; the paper gives a mechanistic view of how fine-tuning on unknown facts can damage access to stored knowledge.

Several good leads stay on the watchlist: STARRY and X-WAM for robotic world-action modeling, GLM-5V-Turbo and NVIDIA Nemotron 3 Nano Omni for multimodal agents, BioMysteryBench for scientific data agents, and Evergreen for claim verification over semantic aggregates. I did not include them here because this issue already has four papers that form a coherent loop from reward, trace, workflow, to mechanism.

Paper Notes

FutureWorld: A Live Environment for Training Predictive Agents with Real-World Outcome Rewards

Authors: Zhixin Han, Yanzhi Zhang, Chuyang Wei, Maohang Gao, Xiawei Yue, Kefei Chen, Yu Zhuang, Haoxiang Guan, Jiyan He, Jian Li, Yitong Duan, Yu Shi, Mengting Hu, Shuxin Zheng
Institutions: Nankai University; Academy of Mathematics and Systems Science, Chinese Academy of Sciences; Zhongguancun Academy; Tsinghua University
Date: arXiv, April 29, 2026
Links: arXiv, open HTML, dataset link announced in paper

FutureWorld question construction pipeline

This figure shows the front half of FutureWorld: public event sources are turned into question-description pairs, filtered, resampled, and rewritten as probability-estimation prompts. The important design choice is that the environment starts before the answer exists, which reduces the usual leakage problem in retrospective prediction datasets. The caveat is that the quality of the environment now depends on source selection, question templates, filtering rules, and later outcome matching, not only on the model.

FutureWorld delayed reward training loop

The training loop is the paper’s core contribution. An agent searches, reads, reasons, and emits a probability; the rollout is stored until the future event resolves; the realized outcome is then matched back to the trajectory and converted into a reward. This supports the claim that live prediction can become an agentic reinforcement-learning environment, but it also introduces delayed labels, changing distributions, and possible reward sparsity.

FutureWorld checkpoint performance

The checkpoint curve is the most useful empirical picture because it asks whether the loop improves prediction behavior over consecutive training days. The paper reports accuracy, Brier score, and calibration error across saved checkpoints, with bootstrap confidence bands. I would read the upward trend as promising but early: the authors say experiments are ongoing, and a live environment can change as news cycles and source pools change.

Quick idea: FutureWorld turns live future prediction into a delayed-feedback training environment. Instead of training agents on already-resolved questions, it generates current prediction questions, lets agents gather evidence and predict probabilities, then waits for real-world outcomes to assign rewards.

Why it matters: most agent training environments are either synthetic, short-horizon, or retrospectively labeled. A prediction agent is different because the world itself eventually provides the label. That makes FutureWorld a useful test case for agents that need to learn from reality without being handed immediate answers.

The method has four concrete steps. First, the system maintains a pool of public event sources, currently described as 72 candidate websites spanning multiple domains. Second, it constructs question-description pairs through rules or templates, filters low-quality items, and resamples to balance domains; the paper reports about 2,047 filtered pairs on average over seven consecutive days before retaining 500 questions after resampling. Third, each retained prompt becomes an agent rollout with search, reading, reasoning, and probabilistic prediction. Fourth, once the scheduled outcome is available, the system retrieves the ground truth, attaches it to the stored trajectory, computes reward, and uses accumulated rewards for parameter updates.

The evidence is still exploratory but concrete. The training section reports day-by-day checkpoint improvements on accuracy, Brier score, and expected calibration error. The daily benchmark also tests frontier web-search agents over four consecutive days. In the reported four-day averages, qwen/qwen3-max-thinking has the highest overall score at 39.01, with google/gemini-3.1-pro-preview at 37.70 and anthropic/claude-opus-4.6 at 37.00. In a selected-day comparison of trained and untrained smaller agents, Qwen2.5-3B-Instruct moves from 0.00 overall to 43.72 after eight days of training, while DeepSeek-R1-0528-Qwen3-8B moves from 13.57 to 34.53. Those numbers are not a final verdict, but they show that the environment is producing a usable training signal.

My judgment is that FutureWorld is worth tracking less for its current leaderboard and more for the delayed-reward protocol. It makes agent learning depend on event resolution, calibration, and outcome matching. The risk is operational: if the question generator or outcome resolver is brittle, the agent may learn artifacts of the environment rather than better forecasting. The next question I would ask is whether FutureWorld can expose uncertainty about its own labels, for example when sources conflict or an outcome is ambiguous.

Connection to tracked themes: agentic training, world-facing rewards, calibration, closed-loop evaluation.

AgentSim: A Platform for Verifiable Agent-Trace Simulation

Authors: Saber Zerhoudi, Michael Granitzer, Jelena Mitrovic
Institutions: University of Passau; Interdisciplinary Transformation University Austria
Date: arXiv, April 29, 2026; SIGIR 2026
Links: arXiv, open HTML, DOI

AgentSim workflow and corpus

This figure lays out AgentSim as a trace-production system, not just a benchmark. It combines simulation over a document collection, trace recording, validation, and corpus construction. The key claim is that RAG agents need grounded intermediate steps tied to specific documents; free-form chain-of-thought is not enough for audit or training.

AgentSim exploration breadth

The second figure shows why seed selection matters. The corpus-aware policy is meant to retrieve broader, less redundant document regions across datasets rather than repeatedly sampling the same easy neighborhoods. The caveat is that breadth is measured through retrieval behavior, so it depends on the chosen encoder, corpus, and top-k retrieval setup.

Quick idea: AgentSim generates verifiable RAG-agent traces over document collections. Its central move is to make the reasoning path auditable by linking each substantive step to retrieval actions, documents, and validation signals.

Why it matters: a lot of “deep research” and RAG-agent work still evaluates only final answers. That makes it hard to tell whether an agent reached an answer through grounded evidence, lucky retrieval, or unsupported synthesis. AgentSim addresses the data problem: if we want trustworthy RAG agents, we need traces that can be checked at the step level.

The system has two main mechanisms. Corpus-Aware Seeding selects starting queries to cover different topics and document regions, using embedding clusters, novelty thresholds, and a diversity-relevance tradeoff. Active Validation uses multi-model disagreement to decide where human review is most valuable, so annotation effort is spent on ambiguous steps rather than uniformly across the trace stream. The platform also includes a visual interface for prototyping and a command-line toolkit for scalable generation.

The corpus numbers are substantial. The Agent-Trace Corpus contains 103,567 reasoning trace steps, 20,548 supervised training pairs, 199,968 unique retrieved documents, and 26,176 generated queries. It spans MS MARCO, Quasar-T, and CausalQA, with analyst models including gpt-4o, mistral-large, and deepseek-v3. For the seeding experiment at a budget of 500 seeds and five runs, the corpus-aware method reaches full cluster coverage across the three reported datasets, with near-zero retrieval redundancy. That supports the authors’ narrower claim: careful seed policy can make trace simulation explore more of the corpus.

The paper also reports behavioral differences across analyst models. Query reformulation is mostly conceptual for all three models, but deepseek-v3 uses syntactic reformulations much more often than mistral-large or gpt-4o. I like that analysis because it treats agent traces as behavioral data, not only training data. If agents are going to be audited, we need this kind of trace taxonomy.

My concern is that simulated traces can become too clean. Even with active validation, a generated RAG trace is not the same as a human research session with messy goals, broken pages, contradictory documents, and time pressure. Still, AgentSim is a useful piece of infrastructure because it separates grounded trace generation from final-answer scoring. For Paper Radar’s interests, it pairs naturally with document intelligence and data agents: both need evidence trails that survive review.

Connection to tracked themes: document intelligence, RAG agents, verifiable traces, data construction.

DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

Authors: Jinxiang Meng, Shaoping Huang, Fangyu Lei, Jingyu Guo, Haoxiang Liu, Jiahao Su, Sihan Wang, Yao Wang, Enrui Wang, Ye Yang, Hongze Chai, Jinming Lv, Anbang Yu, Huangjing Zhang, Yitong Zhang, Yiming Huang, Zeyao Ma, Shizhu He, Jun Zhao, Kang Liu
Institutions: Institute of Automation, Chinese Academy of Sciences; University of Chinese Academy of Sciences; National University of Singapore; Renmin University of China
Date: arXiv, April 28, 2026
Links: arXiv, open HTML, Hugging Face paper page

DV-World lifecycle overview

This overview is the reason I selected DV-World. The benchmark covers spreadsheet-native charting, evolution of existing visualizations across frameworks, and interactive clarification with a user simulator. That is much closer to real data work than “generate one chart from one clean table.”

DV-World grounding analysis

This figure connects table coverage, visual quality, and repair success. The left panel suggests that better data grounding and visual aesthetics move together in chart creation, while the right panel breaks down fix categories. The caution is that these metrics mix rule-based and MLLM-judged components, so the authors’ validation of the evaluator matters.

DV-World interactive asking analysis

The interaction figure is the most product-facing result. Asking the user can help, but only when the questions are effective and reduce ambiguity rather than adding friction. This is a useful correction to a common agent assumption: “ask more” is not automatically better; the agent has to know what ambiguity blocks the task.

Quick idea: DV-World evaluates data-visualization agents across the full professional lifecycle: native spreadsheet manipulation, cross-framework visualization evolution, and proactive intent alignment under ambiguous user requests.

Why it matters: data agents often look strong in code sandboxes but fail when the work involves spreadsheets, chart objects, visual inspection, user clarification, and business constraints. DV-World makes those frictions part of the benchmark. That puts it squarely in the “data agent” thread rather than a generic chart-generation paper.

The benchmark design has three parts. DV-Sheet tests chart and dashboard creation plus diagnostic repair inside native spreadsheet-like settings. DV-Evolution asks agents to adapt and restructure existing visual artifacts across Python, Apache ECharts, Vega-Lite, D3.js, and Plotly.js. DV-Interact adds a user simulator and ambiguity taxonomy, forcing agents to decide when and how to ask clarifying questions. The benchmark has 260 tasks, including 50 create, 50 fix, and 30 dashboard tasks in DV-Sheet; 80 evolution tasks; and 50 interaction tasks. The paper reports 51 chart types, average DV-Sheet workbooks with 36.53 columns and 11,583.36 rows, and 3.17 ambiguities per DV-Interact task.

The empirical picture is sobering. In native spreadsheet visualization, the strongest reported agents remain around the low 40s in overall score; the text highlights Gemini-3-Pro with a peak score of 40.48 percent and says even GPT-5.2 and DeepSeek-V3.2 do not exceed 38 percent in that setting. In DV-Evolution, top models do better but still struggle with framework shifts; the paper reports Gemini-3-Pro at 51.44 percent overall, with performance varying by visualization library. In DV-Interact, Grok-4 leads with 40.43 percent, but the paper emphasizes that proactive reasoning and clarification remain the bottleneck.

The error analysis is what I would revisit first. For DV-Sheet creation and repair, data accuracy dominates: the paper reports data accuracy as more than half of create errors and 69.31 percent of fix errors on average. For dashboards, the dominant error mode shifts toward visual design, reported at 45.71 percent of the error budget. That tells me current agents can often produce something that looks chart-like, but they still lose the thread between table values, encodings, and the user’s analytic intent.

My judgment is that DV-World is a useful benchmark because it forces data agents to keep several contracts at once: numerical fidelity, visual semantics, software environment, and user intent. The risk is evaluator complexity. Hybrid scoring is probably necessary here, but it gives readers more to audit. I would use DV-World not as a single leaderboard, but as a diagnostic suite for where a data agent breaks.

Connection to tracked themes: data agents, document and spreadsheet intelligence, visual grounding, interactive workflows.

MoRFI: Monotonic Sparse Autoencoder Feature Identification

Authors: Dimitris Dimakopoulos, Shay B. Cohen, Ioannis Konstas
Institutions: University of Edinburgh; Heriot-Watt University
Date: arXiv, April 29, 2026
Links: arXiv, open HTML

MoRFI single-latent steering

This figure shows the validation step that makes MoRFI more than a correlation screen. The authors identify SAE latents that trend monotonically as the fine-tuning mix changes, then steer individual latents and compare them with control latents. The control columns matter: they test whether the gains come from targeted features rather than arbitrary residual-stream perturbation.

MoRFI composite direction comparison

The second figure compares a composite known-to-unknown direction with individual top latents across three models. The paper reports that subtracting the composite direction consistently helps, but individual MoRFI-selected latents often help more. That supports a sparse-mechanism reading: the harmful fine-tuning shift is not evenly distributed across the whole direction.

Quick idea: MoRFI studies why supervised fine-tuning on new factual knowledge can increase hallucinations. It uses sparse autoencoders to find latents whose activations change monotonically with controlled fine-tuning conditions, then tests whether steering those latents restores access to pre-trained knowledge.

Why it matters for agents: agent behavior is increasingly created through post-training, tool-use data, synthetic traces, and domain adaptation. If adding new facts or skills damages access to stored knowledge, agent reliability can degrade in ways that final benchmark scores hide. MoRFI gives a mechanistic lens on that failure mode.

The experimental design is deliberately controlled. The authors fine-tune Llama 3.1 8B, Gemma 2 9B, and Mistral 7B v0.3 on closed-book QA data derived from EntityQuestions. They vary two axes: the percentage of facts that were unknown to the pre-trained model and the number of training epochs. They then collect residual-stream activations across checkpoints, pass them through pre-trained SAEs, and build a four-dimensional activation tensor over samples, property settings, features, and time. MoRFI filters SAE features that show robust monotonic trends under bootstrapped statistical testing. A second step applies activation steering to test whether the selected features causally affect accuracy.

The evidence is specific. The dataset split reported in the appendix contains 81,700 train examples, 10,725 dev examples, and 10,481 test examples. The authors show that increasing the share of unknown facts in the fine-tuning mixture reduces test performance across the three models, especially with longer training. In cross-model steering results, single-latent interventions recover a large fraction of the accuracy lost from fine-tuning on unknown facts; the paper reports recovery values in the 69 to 85 percent range across model and direction settings. For Llama 3.1 8B, the fully unknown fine-tuned baseline is reported at 0.178 accuracy and the known-fact baseline at 0.355; selected latent interventions move the unknown-fact checkpoint upward, with some interventions reaching 0.238.

My caution is scope. This is a closed-book QA setup with particular base models, SAEs, relations, and fine-tuning mixtures. It does not prove that every post-training hallucination has the same sparse direction structure. Still, the paper is useful because it turns a vague claim, “fine-tuning can hurt knowledge,” into a testable mechanism: identify monotonic latent shifts, perturb them, and measure whether knowledge access returns.

I would connect MoRFI back to the other three papers like this: FutureWorld, AgentSim, and DV-World all generate new agent data and feedback. MoRFI reminds us that adding data is not automatically safe. We need monitors for what post-training changes inside the model, especially when agent training loops keep updating behavior.

Connection to tracked themes: large model mechanisms, post-training reliability, sparse autoencoder analysis, agentic training risk.

Reading Priority and Next Questions

I would read FutureWorld first if the goal is agentic training, because the delayed reward setup is a real design shift. AgentSim and DV-World should be read together if the goal is auditable data/document workflows: one focuses on trace construction, the other on professional data-visualization tasks. MoRFI is the mechanism paper I would keep beside any agent-training work, because it asks what post-training does inside the model.

For the next radar, I want to track three questions: can live environments attach confidence to delayed labels; can trace simulators represent messy, contradictory evidence rather than clean retrieval paths; and can post-training pipelines monitor internal knowledge-access damage before it shows up as deployment failures?