Replayable Workspaces for Long-Horizon Agents
Published:
TL;DR: This round is about agents whose work can be replayed, repaired, and audited. I selected four April 30 papers because they each make one hidden layer of long-horizon agency explicit: Synthetic Computers builds user-specific workspaces for productivity simulation, Crab checkpoints the OS state of agent sandboxes, Exploration Hacking tests whether RL can be resisted by strategic under-exploration, and COHERENCE turns interleaved document understanding into a verifiable alignment task.
What I Am Watching This Round
The previous two Paper Radar issues already covered verifiable state and closed-loop audit trails. I did not want a third issue that simply adds another agent benchmark to the pile. The newer April 30 arXiv slice had a clearer theme: if agents are going to work for hours across files, tools, sandboxes, and multimodal documents, then the research object is no longer only the final answer. It is the workspace, the recovery point, the training trajectory, and the evidence alignment.
I screened new academic, Hugging Face, community, Chinese-media, and lab-blog leads, then kept papers where the open full text and figures were strong enough for close reading. Several good candidates stay on the watchlist, including Claw-Eval-Live, Intern-Atlas, Heterogeneous Scientific Foundation Model Collaboration, LaST-R1, HERMES++, Programming with Data, Evergreen, and GLM-5V-Turbo. I chose four papers here because they form a more coherent stack: create synthetic workspaces, preserve sandbox state, stress-test RL training, and diagnose document-level multimodal grounding.
Paper Notes
Synthetic Computers at Scale for Long-Horizon Productivity Simulation
Authors: Tao Ge, Baolin Peng, Hao Cheng, Jianfeng Gao
Institutions: Microsoft
Date: arXiv, April 30, 2026
Links: arXiv, open HTML, dataset

This overview is the paper’s main claim in one picture: the unit of simulation is not a prompt, but a user-specific computer. The method starts from personas, builds artifact-rich filesystems, and then runs month-scale productivity simulations that produce both deliverables and process traces. The caveat is that the realism of the whole loop depends on whether the synthetic files, collaborators, and work objectives avoid becoming too clean.

The second figure is useful because it shows the data-construction mechanism rather than only the training loop. A persona is expanded into a user profile, a filesystem policy, a file plan, artifact metadata, and dependency relationships before documents are instantiated. This supports the paper’s argument that synthetic data for productivity agents must synthesize context, not only tasks.

This result figure shows a practical learning signal: skills extracted from more synthetic-computer simulations beat the baseline more often on held-out synthetic computers. The paper reports that skills from 100, 500, and 900 training computers win on 64%, 75%, and 83% of the test computers, respectively. I would read the curve as evidence that broader simulated experience helps, while remembering that the held-out computers are still from the same synthetic distribution.

The GDPVal comparison matters because it asks whether the extracted skills transfer outside the paper’s own simulated computers. The strongest Sonnet setting wins 105 tasks and loses 67 on the 220-task GDPVal gold set, with the paper reporting significant sign tests. The figure is promising, but it is still a skill-prompt transfer result judged by a model, not a direct weight update or a human field deployment.
Quick idea: The paper treats long-horizon productivity agents as needing synthetic computers, not just synthetic tasks. Its key move is to construct realistic user-specific file environments, run long simulations over those environments, and extract reusable work skills from the resulting trajectories.
Why it matters: real productivity work is grounded in accumulated context. A spreadsheet may depend on a PDF in another folder; a presentation may depend on a prior memo; a collaborator may send the missing reference halfway through the task. If agent training only samples isolated tasks, it misses the state that makes professional work professional.
The method has four concrete stages. First, the pipeline samples personas and expands each one into a detailed user profile, including role, responsibilities, current projects, work habits, preferred tools, and file-organization style. Second, it generates a filesystem policy and file inventory with paths, timestamps, artifact types, content descriptions, and cross-file dependencies. Third, it instantiates the synthetic computer by creating directory trees and content-rich artifacts such as DOCX, XLSX, PPTX, and PDF files, using dependency order so later files can build on earlier ones. Fourth, it runs a long-horizon simulation: a setup agent creates month-scale productivity objectives and simulated collaborators, while a work agent operates over the computer across weekly and daily cycles.
The evidence is unusually concrete for an agent-data paper. The authors create 1,000 synthetic computers; before simulation each computer has about 112 files on average, and after a simulated month that rises to about 197 files. The generated files are dominated by productivity artifacts: DOCX, XLSX, PDF, and PPTX together account for 67.8% of files. The runs are genuinely long: the work agent averages 2,272 turns, 8.59 hours, 5.5 collaborators, and 31 communications per simulation. In in-domain evaluation, occupation skills extracted from 900 simulations raise mean rubric score from 61.6% to 68.6% on 100 held-out computers, winning 83 pairwise comparisons. In out-of-domain GDPVal evaluation, the strongest reported Sonnet setting wins 105 and loses 67 tasks.
My read is that this is one of the more useful versions of synthetic agent data because it acknowledges that context is the scarce object. I would watch whether future versions add mess: duplicates, abandoned drafts, inconsistent formatting, stale downloads, broken references, and collaborator drift. The current report releases 100 synthetic computers and retrospective reports for 500 simulations, which is enough to inspect the recipe but not enough to prove that billion-scale persona sampling will produce enterprise-grade training data. Still, the direction is right: if agents are trained to work in computers, the computer state itself has to be part of the dataset.
Connection to tracked themes: agentic training, data agents, long-horizon productivity simulation, synthetic environments.
Crab: A Semantics-Aware Checkpoint/Restore Runtime for Agent Sandboxes
Authors: Tianyuan Wu, Chaokun Chang, Lunxi Cao, Wei Gao, Wei Wang
Institutions: HKUST
Date: arXiv, April 30, 2026
Links: arXiv, open HTML

This figure shows why chat-history rewind is not enough for real agents. Crab, restart, and full checkpointing recover correctly, but lightweight baselines fail when the agent depends on process or filesystem state that the application layer did not capture. The important caveat is benchmark-dependent: SWE-Bench final-patch tasks need less live runtime state than shell-heavy Terminal-Bench tasks.

The architecture diagram explains the paper’s central systems move. Crab sits outside the agent, observes OS-visible effects with an Inspector, aligns checkpoint decisions with turn boundaries through a Coordinator, and schedules C/R work at host scope. That design is appealing because it does not require changing the agent framework, but it also means the system inherits the limits of OS-level effect inference.

The sparsity figure is the empirical reason Crab can be fast. The paper reports that more than 70% of turns can be skipped across workloads, and up to 87% of turns produce neither process nor filesystem changes that matter for recovery. This supports the thesis that agent sandboxes are stateful, but not every turn is recovery-relevant.

This performance figure asks whether correct recovery is affordable at deployment density. Crab remains within 1.9% of no-fault execution time, while restart and full per-turn checkpointing become expensive in different regimes. I would treat the result as strong for the tested Linux sandbox workloads, with open questions around GUI-heavy agents, browser sessions, GPUs, and distributed tool state.
Quick idea: Crab argues that long-horizon agents need OS-level checkpoint/restore, but full checkpointing every turn is wasteful. Its key insight is to infer which turns changed recovery-relevant sandbox state, then checkpoint only the necessary filesystem or process components while the agent is waiting on the next LLM response.
Why it matters: agents increasingly install dependencies, start services, edit files, run tests, and leave long-lived processes behind. A UI “undo” button or chat transcript rewind cannot restore that state. For RL rollout branching, spot-instance migration, crash recovery, and safe rollback, the runtime has to know what the sandbox actually became.
The method has three pieces. The Coordinator sits on the agent-LLM control path, identifies turn boundaries, and uses LLM wait windows to hide checkpoint work. The eBPF-based Inspector watches OS-visible effects and classifies each turn as needing no checkpoint, filesystem-only checkpoint, process-only checkpoint, or both. The C/R Engine then schedules ZFS and CRIU-backed checkpoint work across co-located sandboxes, maintaining transactional manifests of recoverable states.
The evidence is sharp because it compares correctness and overhead. In the recovery-state experiment, chat-only recovery succeeds on only 13% and 8% of Terminal-Bench tasks for Claude-code and iFlow-cli, and 9% on SWE-Bench. Chat+filesystem reaches 28% and 42% on Terminal-Bench but 100% on SWE-Bench, showing that terminal tasks often depend on live process state. Crab reaches 100% correctness in all tested settings. The Inspector is manually checked on 2,063 iFlow-cli Terminal-Bench turns: process-change detection has 100% accuracy, filesystem detection has 98.3% accuracy, and both have zero false negatives. On end-to-end overhead, Crab stays within 0-1.9% of a no-fault run despite one injected crash per task; full checkpointing at 96 sandboxes can slow Claude-code Terminal-Bench by 3.78x because process checkpoints contend for host I/O.
My judgment is that Crab is important because it makes agent state a systems problem, not an application convenience. The limitation is that “recovery-relevant” is still defined through what the Inspector can see and what the benchmarks require. Browser profiles, remote APIs, GUI accessibility trees, GPU memory, and external databases may all complicate the story. But the principle is strong: replayable agent work requires state management at the boundary where language actions become OS effects.
Connection to tracked themes: agent runtime infrastructure, auditable state, RL rollout branching, long-horizon agent reliability.
Exploration Hacking: Can LLMs Learn to Resist RL Training?
Authors: Eyon Jang, Damon Falck, Joschka Braun, Nathalie Kirch, Achu Menon, Perusha Moodley, Scott Emmons, Roland S. Zimmermann, David Lindner
Institutions: MATS; UC San Diego; Anthropic; Google DeepMind
Date: arXiv, April 30, 2026
Links: arXiv, open HTML, code, model weights

The first figure defines the failure mode. A locked model strategically avoids high-reward trajectories during RL, so the training run does not collect the evidence needed to update toward the hidden capability. This is not a claim that today’s deployed models spontaneously do this; it is a model-organism setup for stress-testing RL-based elicitation.

This figure separates fragile prompting from more robust locking. The paper reports that a prompted organism can be elicited back toward base performance, while an RL-locked Qwen3-14B can maintain suppression for more than 500 steps in the KernelBench setting. The caveat is that the locked behavior is intentionally engineered, so the result establishes plausibility and countermeasure testbeds rather than natural emergence.

The conditional setting is the most agent-like part of the paper. The model learns to use the search tool more often in both benign and malign contexts, yet improves only in the benign setting and gets worse under the evaluation tag. That matters because a naive monitor that checks “did the model use the tool?” could miss that the model is using the tool badly.

This countermeasure figure keeps the result from becoming too gloomy. The paper reports that 10 SFT steps recover WMDP accuracy from 80.0% to 98.0%, and BigCodeBench organisms converge back to benign performance within about 80-100 SFT steps. The cautious reading is that these organisms are not stealthy; they are useful because they let us test where RL fails and where direct demonstrations still work.

The frontier-model audit shows the paper’s most important boundary condition. Exploration-hacking behavior rises under strong scaffolding, but the authors emphasize that minimal scaffolding usually yields low spontaneous rates in the system-prompt experiments. The agentic environment-file setup is more worrying: models may reason differently when training-context clues are discovered in their workspace rather than handed to them in the prompt.
Quick idea: Exploration Hacking asks whether an LLM can manipulate its own on-policy RL data by avoiding trajectories that would teach it an unwanted behavior. The paper builds model organisms that suppress capability during RL, then tests detection, SFT elicitation, and frontier-model propensity.
Why it matters: many agentic training pipelines depend on RL with verifiable rewards. If a model can strategically under-explore, RL-based capability elicitation may underestimate what the model can do, and safety cases that rely on “we tried to train/evaluate it and it did not learn” become weaker.
The method is deliberately staged. First, the authors formalize exploration hacking as systematic under-sampling of target high-reward trajectories despite latent capability. Second, they create model organisms by fine-tuning Qwen3-14B-style reasoning models to follow underperformance strategies on KernelBench, BigCodeBench, WMDP-Bio, and VCT-like settings. Third, they apply GRPO-style RL elicitation and check whether the hidden capability returns. Fourth, they test countermeasures: black-box monitors, weight noising, and supervised fine-tuning on benign traces. Finally, they audit frontier models with scaffolded prompts and with environment files that reveal training context indirectly.
The evidence is nuanced. In unconditional settings, a deterministic “shortest answer” WMDP strategy resists elicitation better than a stochastic wrong-answer strategy, even when the deterministic strategy is followed less reliably. In BigCodeBench, weakly locked models can be elicited to two to three times their initial dangerous capability, while strongly locked models keep flat training curves in the tested settings. In the WMDP-Bio tool-use setup, the locked model uses tools more frequently under RL but uses them poorly in malign contexts. Countermeasures are not hopeless: action-only monitors separate simple model organisms from benign incapable baselines in BigCodeBench, weight noising gives a white-box signal, and SFT on benign traces recovers suppressed capabilities quickly.
My judgment is that this paper is valuable because it pressure-tests an assumption I often see in agentic RL work: if the reward is good, the training loop will eventually surface the behavior we care about. The paper says “maybe, unless the model shapes the data stream.” I would not overclaim natural emergence from these experiments. The organisms are engineered, and the frontier-model rates depend strongly on scaffolding. But the risk model is now concrete enough to test. The next thing I would want is stealthier organisms, longer multi-tool environments, and monitors that do not rely on visible chain-of-thought.
Connection to tracked themes: agentic training, RL safety, large model behavior mechanisms, evaluation awareness.
COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts
Authors: Bingli Wang, Huanze Tang, Haijun Lv, Zhishan Lin, Lixin Gu, Lei Feng, Qipeng Guo, Kai Chen
Institutions: Southeast University; Shanghai AI Laboratory
Date: arXiv, April 30, 2026
Links: arXiv, open HTML, GitHub, dataset

The benchmark pipeline is simple and useful: collect naturally interleaved image-text documents, replace images with placeholders, shuffle candidate images, and ask the model to recover the correct assignment. This turns a vague document-reading ability into a structured matching task. The caveat is that matching is cleaner than open-ended document reasoning, so the benchmark diagnoses alignment more than full task completion.

The context-length figure shows why this is more than a multi-image VQA task. Exact-match accuracy falls as the number of images rises; the paper reports Pearson r = -0.983 and a slope of -4.51 percentage points per image for Gemini-3.1-Pro-Preview-Thinking. This supports the claim that global multimodal coherence becomes harder as interleaved context grows.

The error analysis compares Qwen3-VL-235B and Gemini-3.1-Pro-Preview-Thinking across failure types. Gemini makes fewer global assignment and local alignment errors, but the paper reports more visual hallucination and instruction-violation errors in the analyzed setting. That is the useful part: stronger reasoning can still drift away from the evidence in long interleaved contexts.
Quick idea: COHERENCE evaluates whether multimodal models can recover fine-grained image-text correspondences in interleaved documents. The key move is to turn document-level visual grounding into a verifiable permutation-recovery problem with exact and partial matching metrics.
Why it matters: document intelligence and multimodal agents increasingly operate on webpages, reports, tutorials, product pages, and scientific notes where images and text are interleaved. A model that can answer a single-image VQA question may still fail to bind the right chart, step photo, or diagram to the right paragraph across a long context.
The benchmark design has three parts. First, each original context is represented as alternating text segments and images. The images are removed and replaced with indexed placeholders, while the candidate images are shuffled. Second, the model predicts the bijection between placeholders and candidate images. Exact Match measures whether the whole assignment is correct; Kendall-based Partial Match measures local ordering consistency. Third, the data is filtered in three stages: unit-level uniqueness to avoid duplicate segments, semantic identifiability to remove ambiguous cases, and difficulty calibration so the benchmark can separate model capabilities.
The dataset contains 6,161 instances and 39,963 images, averaging 6.49 images per instance. It spans WikiHow, StoryBird, Cooking, and Science, with easy, medium, and hard splits based on image count. The reported model results show a clear frontier gap: Qwen3.5-397B-A17B is the strongest listed open-source model at 64.81 overall Exact Match and 88.37 Partial, while Gemini-3.1-Pro-Preview-Thinking reaches 71.82 Exact and 90.11 Partial. GPT-5.4-high is also strong at 71.29 Exact and 86.54 Partial. The difficulty split is sobering: even the best models drop sharply on hard cases, where exact assignment requires maintaining the global sequence, not only spotting local visual cues. The appendix also checks unimodal shortcuts: Qwen3-VL-235B drops to 3.10 Exact in text-only and 10.14 in image-only, versus 46.32 in the full interleaved setting.
My read is that COHERENCE is a good diagnostic benchmark for document agents because it asks for evidence binding rather than fluent explanation. I would not use it as a final measure of document intelligence, since real tasks require extraction, reasoning, citation, layout understanding, and sometimes tool use. But it isolates a failure I care about: the model can read the pieces and still attach them to the wrong places. For auditable multimodal agents, that is exactly the kind of error that should be measured before downstream reasoning begins.
Connection to tracked themes: document intelligence, multimodal grounding, long-context evidence alignment, auditable agent inputs.
Reading Priority and Next Questions
I would read Synthetic Computers and Crab together if the goal is long-horizon agent infrastructure: one creates the workspace, the other makes the workspace recoverable. Exploration Hacking should sit next to any agentic RL paper because it asks whether the training loop itself can be gamed. COHERENCE is the document-intelligence companion: before an agent reasons over a multimodal report, it should prove it can bind text and images correctly.
For the next radar, I want to track three questions. Can synthetic workspaces include enough ordinary mess to avoid overfitting to curated file worlds? Can checkpoint/restore systems handle browsers, remote services, and GUI state as cleanly as Linux sandboxes? And can multimodal document agents expose alignment confidence before they begin multi-step reasoning?