Verifiable State for Long-Horizon Agents

16 minute read

Published:

TL;DR: This round is about a practical shift in agent research: stronger agents are not only longer-context models, but systems that synthesize tasks, verify intermediate state, preserve evidence, and learn from world-state transitions. I selected four recent open papers because each one makes state more inspectable: ClawGym builds verifiable computer-use tasks, World2VLM turns world-model imagination into training data, DataPRM verifies data-analysis steps inside the environment, and OCR-Memory stores long agent histories as optically retrievable evidence.

What I Am Watching This Round

The repository has no previous Paper Radar entries and the seen list was empty, so I treated this issue as a baseline-setting run rather than chasing the broadest possible news spread. I searched recent arXiv papers from April 27-29, 2026 across agentic training, world models, model mechanisms, document intelligence, and data agents, then cross-checked community and Chinese-media leads. Several strong leads remain on the watchlist, including Recursive Multi-Agent Systems, Programming with Data, DV-World, NVIDIA Nemotron 3 Nano Omni, and recent visual-reasoning reports, but I did not include them here because this issue can explain the selected four more carefully from open full text.

The shared thread is simple: long-horizon agents fail when their state is implicit. These papers attack that from different sides: task generation with executable verifiers, training-time spatial transition supervision, process reward models that run checks instead of reading traces passively, and memory systems that retrieve exact historical evidence rather than compressed summaries.

Paper Notes

ClawGym: A Scalable Framework for Building Effective Claw Agents

Authors: Fei Bai, Huatong Song, Shuang Sun, Daixuan Cheng, Yike Yang, Chuan Hao, Renyuan Li, Feng Chang, Yuan Wei, Ran Tao, Bryan Dai, Jian Yang, Wayne Xin Zhao
Institutions: Gaoling School of Artificial Intelligence, Renmin University of China; IQuest Research; Beihang University
Date: arXiv, April 29, 2026
Links: arXiv, open HTML, project link announced in paper

Overview of the ClawGym-SynData pipeline

Figure 1 is the right starting point because it shows ClawGym as an end-to-end data engine, not just another benchmark. The pipeline starts from persona-driven and skill-grounded task sources, constructs resources and verifiers, filters for task quality, and then packages the result for both training and evaluation. The important caveat is that the figure describes a controlled OpenClaw-style environment; it does not prove that the same synthesis recipe will transfer unchanged to arbitrary desktop or web-agent settings.

Figure pointer: Table 6 compares proprietary models, open-weight frontier models, compact Qwen baselines, and ClawGym-tuned models on ClawGym-Bench and PinchBench. It supports the central empirical claim that agent-specific interaction data can move compact models substantially, but it should be read together with the evaluation protocol because ClawGym-Bench combines code-based checks and rubric-based judgment depending on the task.

Quick idea: ClawGym asks how to build personal agents that operate over files, tools, and persistent workspaces when human-written tasks and verifiers are scarce. The key move is to synthesize many executable Claw-style tasks, attach hybrid verification, collect black-box rollout trajectories, and use those trajectories for supervised and lightweight reinforcement training.

Why this matters to me: agent training often gets discussed as if the missing piece is only better policy optimization. This paper makes the less glamorous but more decisive point that agents need a task factory, realistic workspaces, and verifiers that can judge final state. Without those, policy learning collapses into imitation of brittle transcripts.

The method has four concrete stages. First, ClawGym-SynData combines persona-driven top-down task generation with skill-grounded bottom-up task construction, so the task pool covers both user-like intents and tool-operation primitives. Second, each task is paired with resource preparation and verification design: code-based verifiers where final artifacts can be checked mechanically, rubric-based criteria where semantic judgment is needed, and weighted aggregation when both are present. Third, the authors run strong black-box agents on the OpenClaw harness and filter trajectories by reward to build training data. Fourth, they fine-tune Qwen3-family backbones into ClawGym-Agents and also explore a sandbox-parallel RL loop.

The evidence is useful because it includes data scale, benchmark design, and model outcomes. The paper reports 13.5K executable synthesized tasks, 24.5K interaction trajectories, and a 200-instance human-reviewed ClawGym-Bench across six categories. On the summary comparison, ClawGym-4B, ClawGym-8B, and ClawGym-30A3B reach average ClawGym-Bench scores of 47.73, 50.24, and 56.82, respectively; the corresponding compact Qwen baselines are lower, and ClawGym-30A3B even exceeds the larger Qwen3-235A23B on that benchmark. The authors also report large relative gains on PinchBench and ClawGym-Bench for Qwen3-8B and Qwen3-30A3B after training on synthesized data.

My judgment is positive but bounded. I would keep watching ClawGym because it treats environment state and verification as first-class training infrastructure. The risk is benchmark-local optimization: the stronger the synthetic generator and verifier become, the easier it is for models to specialize to the benchmark’s shape. The next question is whether the same task synthesis and verifier design can survive messier user environments where task success is less cleanly reducible to final files, tables, or checklist items.

Connection to tracked themes: agentic training, data agents, evaluation infrastructure, verifiable tool use.

World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning

Authors: Wanyue Zhang, Wenxiang Wu, Wang Xu, Jiaxin Luo, Helu Zhi, Yibin Huang, Shuo Ren, Zitao Liu, Jiajun Zhang
Institutions: Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences; Tsinghua University; Harbin Institute of Technology; Wuhan AI Research
Date: arXiv, April 29, 2026
Links: arXiv, open HTML, code, dataset

Overview of World2VLM

Figure 2 explains the whole argument: use a controllable world model to synthesize action-aligned future views, turn each transition into forward and inverse spatial tasks, and post-train a VLM with SFT followed by task-aware GRPO. This supports the claim that world models can be training-time teachers rather than inference-time tools. The caveat is also visible in the diagram: the quality of every downstream label depends on whether the generated transition is geometrically faithful enough.

Figure pointers: Table 2 reports the main benchmark comparison across SAT-Real, SAT-Synthesized, VSI-Bench, and MindCube. Table 4 is the most useful ablation because it separates forward-only, inverse-only, and bidirectional supervision, and also compares real-scene-only, simulated-scene-only, and mixed-source training.

Quick idea: World2VLM targets a specific weakness of VLMs: they can describe static images but struggle to imagine how a scene changes under egocentric motion. Instead of calling a world model at test time, the paper distills world-model-generated transitions into the VLM during post-training.

The method is clean. First, the pipeline starts from an anchor observation and a parameterized camera trajectory, then uses teacher world models such as SVC and HY-WorldPlay to synthesize a future view. Second, detector-derived metadata and the source-target pair are converted into eight task families covering motion inference, action verification, object grounding, visibility judgment, and cross-view consistency. Third, Qwen2.5-VL-7B is fine-tuned on about 103K mixed-source records. Fourth, a 1K balanced refinement set is used for GRPO, with task-specific rewards for formatting, numeric accuracy, spatial logic, sequence consistency, and box validity where relevant.

The main evidence is stronger than a single aggregate number because the paper gives both teacher variants and task breakdowns. The base Qwen2.5-VL-7B averages 36.63 across the four benchmarks. World2VLM-GRPO reaches 52.07 with SVC as the teacher and 52.61 with HY-WorldPlay, while the matched MindJourney-style inference-time world-model baseline averages 38.65. On SAT-Real, SVC-GRPO reaches 72.67 compared with 44.67 for the base model. The SAT-Real breakdown is also informative: gains are large on egocentric movement, object movement, goal aim, action consequence, and perspective taking, which makes the improvement look more like motion-conditioned reasoning than generic visual QA transfer.

The ablation is the part I care about most. Forward-only supervision improves average score to 43.44, inverse-only to 41.46, and bidirectional full SFT to 46.75. Mixed-source training also beats real-only and simulated-only variants. This says the model benefits from seeing both directions of the transition: “what action caused this view?” and “what view follows this action?” That is a useful design pattern for world-model distillation beyond this paper.

My concern is teacher reliability. The authors discuss this directly: SVC gives tighter camera-conditioned geometry but can show artifacts, while HY-WorldPlay may be more perceptually smooth but less exact about camera motion. The current pipeline filters noisy samples, but it does not yet estimate teacher confidence in a soft, task-aware way. I would treat World2VLM as a strong proof of the training-time-teacher direction, with the next step being reliability-weighted distillation from multiple world models.

Connection to tracked themes: world models, multimodal reasoning, agentic post-training, embodied spatial understanding.

Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis

Authors: Zhisong Qiu, Shuofei Qiao, Kewei Xu, Yuqi Zhu, Lun Du, Ningyu Zhang, Huajun Chen
Institutions: Zhejiang University; Ant Group
Date: arXiv, April 27, 2026
Links: arXiv, open HTML, code

Overview of DataPRM

Figure 3 shows why DataPRM is not just a classifier over reasoning traces. It builds process-supervision data through diverse trajectory generation and knowledge-augmented step annotation, then uses an environment-aware verifier that can interact with code and apply a reflection-aware reward. The diagram supports the paper’s core claim that data-analysis supervision must inspect execution state, not only text.

Figure pointers: Figure 2 is worth checking because it diagnoses why general PRMs fail: they penalize recoverable grounding errors and miss silent logical errors. Table 2 gives the main Best-of-N results on ScienceAgentBench and DABStep, while Table 3 isolates the contribution of environment interaction, multi-turn checking, and the reflection-aware reward.

Quick idea: DataPRM argues that process reward models trained for math-style reasoning are poorly matched to data-analysis agents. A data agent needs a verifier that can run probes, tolerate exploratory errors, and detect code that executes but computes the wrong result.

The paper begins with a useful distinction. A “grounding error” can be recoverable: the agent guesses a column name, receives an exception, learns the schema, and later solves the task. A “silent error” is more dangerous: the code runs, but the logic is wrong. General PRMs tend to punish the first too harshly and miss the second because they do not interact with the environment.

DataPRM addresses this with three mechanisms. First, it uses a ReAct-style generative verifier that can call tools and inspect intermediate execution state. Second, it expands binary step reward into a ternary scheme: incorrect, neutral or exploratory, and correct. That matters because trial-and-error is not always failure in data work. Third, it constructs over 7K process-supervision instances through diversity-driven trajectory generation and expert annotation, then uses the verifier in both test-time scaling and reinforcement learning settings.

The headline numbers are specific enough to be useful. With Qwen3-235B-A22B-Instruct-2507 as the base policy for Best-of-N selection, the 4B DataPRM reaches a DABStep average of 40.89 at N=16, above majority vote at 38.00 and self-rewarding at 39.77 in the same table. The abstract-level summary reports improvements of 7.21% on ScienceAgentBench and 11.28% on DABStep for downstream policy models. In RL experiments, process-supervised training reaches 78.73% on DABench and 64.84% on TableBench, outperforming outcome-reward baselines in the paper’s setup.

The ablation helps keep the story grounded. A plain CoT verifier reaches a DABStep average of 38.89 at N=16; adding environment and multi-turn interaction helps, and the full DataPRM with environment, multi-turn interaction, and reflection-aware ternary reward reaches 40.89. The gains are not enormous, but the direction is plausible: for data agents, better reward comes from checking the world, not only scoring prose.

My main caution is evaluation dependence. Several datasets require model-as-judge decisions for open-ended answers, and the active verifier costs more tokens and latency than a simple one-pass judge. I still think this paper is worth following because it names the right bottleneck: data-agent reward models must separate useful exploration from real semantic failure.

Connection to tracked themes: data agents, process reward models, agentic training, environment-grounded verification.

OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory

Authors: Jinze Li, Yang Zhang, Xin Yang, Jiayi Qu, Jinfeng Xu, Shuo Yang, Junhua Ding, Edith Cheuk-Han Ngai
Institutions: The University of Hong Kong; University of North Texas; University of Tsukuba; Yonsei University
Date: arXiv, April 29, 2026; accepted to ACL 2026 Main Conference
Links: arXiv, open HTML

Overview of OCR-Memory

Figure 1 is a compact explanation of the mechanism. The agent history is rendered into multi-resolution images, each segment is marked with visual anchors, and the model retrieves by selecting indices; the original text is then fetched deterministically from logs. This supports the paper’s “faithful evidence recovery” claim, but it also makes clear that the system shifts cost into rendering, storage, and a specialized optical retriever.

Figure pointers: Table 1 reports the main Mind2Web and AppWorld results. Tables 2 and 3 are the key ablations for Set-of-Mark prompting and multi-resolution active recall. Tables 6 and 7 are especially practical: they separate retrieval quality from downstream task success and expose the token-latency-storage tradeoff.

Quick idea: OCR-Memory treats long agent history as something to preserve visually and retrieve exactly, rather than summarizing it into lossy text. The model locates relevant image regions, then the system transcribes by index from stored logs instead of asking the model to regenerate the evidence.

The design has three steps. First, interaction trajectories are rendered into visual memory frames, with Set-of-Mark anchors that give each region an explicit identifier. Second, a learned retriever scores which visual segments are relevant to the current query. Third, the selected index maps back to verbatim text in the external log, which is passed to the downstream agent. A multi-resolution policy keeps older or less relevant history compressed and restores high-resolution views when they become active again.

The evidence is broad. On Mind2Web, OCR-Memory reports 53.8 element accuracy, 59.2 action F1, 46.1 step success rate, and 4.8 task success rate, above retrieval, MemoryBank, AWM, and ACON baselines in Table 1. On AppWorld, it reaches a 58.1 average success rate. The Set-of-Mark ablation is also telling: removing SoM and relying on text generation drops element accuracy to 46.5 and step success to 39.2 while increasing retrieval latency to 5.3 seconds; the full system reports 53.8, 46.1, and 1.7 seconds.

The token-budget analysis is the most product-relevant part. On a Needle-in-a-Haystack-style retrieval test, OCR-Memory keeps recall high while compressing visual tokens relative to raw text by about 10x: 98.5% Recall@1 at 4K and 94.1% at 32K. On a dedicated Mind2Web experience-retrieval subset, Recall@1 improves from 52.7 for Dense Text-RAG to 78.6 for OCR-Memory, and MRR rises from 0.61 to 0.84. The system-efficiency table shows the trade: text tokens injected per step fall from 3,980 to 596, but disk per episode rises from 18 KB to 1.47 MB and retrieval latency from 0.3 to 1.7 seconds.

My read is that OCR-Memory is valuable because it refuses to equate memory with summarization. For compliance, debugging, and long workflows, exact evidence recovery matters. The limitations are also practical: the method needs fine-tuning, visual storage, rendering, and an extra retriever, so it is not a free replacement for text RAG. The question I would ask next is whether optical memory still works when histories contain dense tables, code diffs, screenshots, and multilingual documents in the same episode.

Connection to tracked themes: document intelligence, agent memory, long-context systems, auditable workflows.

Reading Priority and Next Questions

I would read ClawGym and DataPRM first if the goal is agent training infrastructure: both are about turning environment state into supervision. World2VLM is the most interesting for world-model distillation because it changes where imagination lives in the system. OCR-Memory is the most product-facing paper in this set because the cost tradeoff is explicit.

For the next radar, I want to track three questions: whether synthetic task generators start converging on common verifier patterns, whether world-model teachers get reliability estimates rather than hard filtering, and whether data/document agents move from answer accuracy toward audit trails that survive real deployment review.