Operating Surfaces for Agents That Need to Stay Correct
Published:
TL;DR: this round is about agents needing structured operating surfaces, not just longer context or more calls to a stronger model. I picked four papers after excluding the recent seen list: a web-to-table multi-agent system with a shared workboard, a step-level cascade for computer-use agents, a trace-level study of information contamination, and a reward-model interpretability library. The first three sit in the newest April 29-30 window; I expanded one slot to April 28 for the mechanism paper because the freshest sparse-autoencoder item had already been covered.
What I Am Watching This Round
The last several Paper Radar issues circled around verifiable state, hard evidence, and scientific-agent interfaces. I do not want this column to become a sequence of agent benchmark summaries. The stronger thread this time is narrower: when an agent has to operate for more than a few steps, the interface it uses becomes part of the model.
For a data agent, the interface can be a shared table-building workboard. For a GUI agent, it can be an event detector that decides when a cheap policy should call a stronger model. For a multi-agent workflow, it can be an execution trace that exposes where contaminated information changed routing and tool use. For RLHF, it can be the reward head vector, which gives interpretability tools a scalar target that is not the vocabulary unembedding.
That is also my writing constraint for this issue. Each selected paper has to introduce the problem before the method, and every figure included below has to carry a real explanatory load.
Paper Notes
Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction
Authors: Yuxuan Huang, Yihang Chen, Zhiyuan He, Yuxiang Chen, Ka Yiu Lee, Huichi Zhou, Weilin Luo, Meng Fang, Jun Wang.
Institutions: University of Liverpool; Huawei Noah’s Ark Lab; University College London.
Date/Venue: April 29, 2026, arXiv preprint.
Links: arXiv | HTML | code

This interface screenshot is a good entry point because it shows the paper’s main claim in operational form. The agent is not only searching the web; it is maintaining a visible table-building workspace with task decomposition, shared constraints, and worker result slots. The useful claim is that broad extraction needs coordination over partial structured state. The caution is that this interface assumes the task can be decomposed into relatively independent subproblems.

The architecture figure separates long-term skills from short-term workboard state. The orchestrator reads decomposition skills, workers read execution skills, and the shared workboard mediates coordination with tag partitioning and file locks. The red training paths matter: the system updates external skill memories through run-verify-reflect episodes, while the underlying LLMs stay frozen. That makes the paper more about agent memory and coordination than about training a new model.
Quick idea: Web2BigTable turns broad web search into a structured web-to-table task, then uses a bi-level agent system where an orchestrator learns decomposition skills and workers learn reusable execution skills through external memory.
Why it matters: many data-agent demos are deep-search demos in disguise. They find one answer, explain it, and maybe cite a page. A business, research, or product-monitoring task often asks for a table: many entities, the same columns for each entity, conflicting sources, and enough coverage that missing rows can change the conclusion. A monolithic agent tends to lose coverage because it has to carry all entities, constraints, source notes, and partial rows in one trajectory.
Method walkthrough:
- The paper defines web-to-table search as producing a schema-aligned table from open web evidence, with both breadth-oriented and depth-oriented settings.
- Web2BigTable splits the policy into an upper-level orchestrator and lower-level workers. The orchestrator decomposes a query into subtasks, while workers search, verify, and write structured outputs into a shared Markdown workboard.
- During training, completed episodes are checked against gold tables. Reflection updates two external skill banks: orchestrator skills for decomposition and worker skills for execution. No gradient update is applied to the underlying LLMs.
- During inference, the learned skill banks are frozen. The system runs workers in parallel and aggregates their table rows from the shared workboard.
Evidence: on WideSearch, the paper reports Avg@4 Success Rate 38.50, Row F1 63.53, and Item F1 80.12. The authors emphasize that this is not simply a stronger-backbone effect: GPT-5 mini and Gemini 3 Flash as single agents reach only 33.28 and 31.61 Item F1, while the full framework reaches 80.12. On XBench-DeepSearch, the Gemini 3 Flash configuration reaches 73.0 accuracy. The case study is also useful: for a Taylor Swift concert table with 534 ground-truth rows, Web2BigTable retrieves 556 deduplicated rows and reaches 93.8% Row F1, while single-agent and skill-less variants retrieve far fewer rows.
My judgment: I like this paper because it treats “memory” as editable procedure, not as a bag of embeddings. The strongest part is the decomposition/workboard/skill-bank loop. The weaker part is that the decomposition assumption is doing a lot of work. If a table row depends on joint reasoning across entities, file locks and worker slots may not be enough; the next version should make cross-worker contradiction resolution more explicit.
Connection to tracked themes: data agents, agent memory, web extraction, structured workflow state.
Step-level Optimization for Efficient Computer-use Agents
Authors: Jinbiao Wei, Kangqi Ni, Yilun Zhao, Guo Gan, Arman Cohan.
Institutions: Yale NLP Lab; University of North Carolina at Chapel Hill.
Date/Venue: April 29, 2026, arXiv preprint.
Links: arXiv | HTML

The pipeline figure explains the paper without needing a leaderboard first. A small policy acts by default, while lightweight monitors watch for two event types: stuck behavior and milestone moments. When risk rises, the controller escalates to a stronger model and then returns control to the cheaper policy. This supports the paper’s practical claim that computer-use agents should allocate compute at the step level rather than invoke a frontier model everywhere.

This table is central because it puts accuracy, latency, cost, and escalation share in the same view. On OSWorld, EvoCUA-8B plus Kimi K2.5 reaches 58.2% success at $0.051 per task, close to standalone Kimi K2.5 at 60.1% and far cheaper than always using Claude Sonnet 4.5. The table also shows the hidden price of cascading: large models still handle around 40-63% of executed steps in the reported OSWorld cascades. This is not “small model replaces big model”; it is “small model avoids some unnecessary big-model calls.”

The ablation figure is the paper’s best argument that the routing policy is not a single heuristic. Stuck detection helps with local loops, while milestone detection catches semantic checkpoints where a trajectory can drift while still looking locally plausible. Using both detectors gives the best success on OSWorld and WebArena in the shown settings. The caveat is that detector labels and event definitions become part of the system contract; if those drift, the cascade can over-escalate or miss important failures.
Quick idea: this paper replaces always-on expensive computer-use inference with an event-driven cascade: cheap default actions, learned monitors, and stronger-model calls only when the trajectory looks risky.
Why it matters: GUI and computer-use agents are expensive partly because each step is treated as equally hard. In reality, many actions are routine, but a few moments decide whether the whole task will loop, drift, or recover. Uniform compute allocation is wasteful, and it also hides an evaluation problem: the final answer may fail because the model made one bad early action, not because it needed frontier reasoning at every step.
Method walkthrough:
- The framework runs a smaller policy as the default agent over OSWorld and WebArena-style interaction trajectories.
- A stuck monitor inspects recent reasoning/action history for repeated actions, lack of progress, or degraded local behavior.
- A milestone monitor identifies semantically meaningful checkpoints where sparse verification by a stronger model is likely to catch silent drift.
- The controller escalates to the stronger model only on detected events, then returns control to the small policy after recovery or verification.
Evidence: on OSWorld, Qwen3-VL-8B plus Kimi K2.5 reaches 59.3% success versus standalone Kimi K2.5 at 60.1%, with lower reported cost. EvoCUA-8B plus Kimi K2.5 reaches 58.2% success at $0.051 per task. On WebArena, AgentTrek-32B plus GPT-5.2 reaches 58.8% versus standalone GPT-5.2 at 60.1%, reducing cost from $0.335 to $0.208 per task. The detector study reports that the learned stuck detector reaches 93.9% accuracy and 91.5% F1; the milestone detector has 94.1% accuracy but lower 62.0% F1, which fits the intuition that milestones are sparse and ambiguous.
My judgment: this is a product-relevant agent paper. It does not pretend that a small model can do everything. It asks a better question: when does the agent need expensive judgment? I would watch whether the same detectors survive outside OSWorld/WebArena, especially on enterprise UIs where “stuck” can look like waiting for a slow service and milestones may be domain-specific.
Connection to tracked themes: agentic training, GUI agents, compute allocation, intermediate-state verification.
Trace-Level Analysis of Information Contamination in Multi-Agent Systems
Authors: Anna Mazhar, Huzaifa Suri, Sainyam Galhotra.
Institutions: Cornell University; University of Illinois Urbana-Champaign.
Date/Venue: April 30, 2026, arXiv preprint; ACM CAIS 2026.
Links: arXiv | HTML

This figure is the right way to introduce the paper. A table parsing error in a revenue-analysis workflow does not just change one value; it changes downstream routing, retries, and interpretation. The figure supports the paper’s core claim that uncertainty in multi-agent workflows is not only an input-quality issue. It becomes a trajectory issue once agents route work, call tools, and write intermediate state based on corrupted artifacts.

The structural edit-distance plot shows that perturbations differ in how much they reshape workflow traces. OCR noise produces consistent structural change, while image blur has higher variance. This matters because two perturbations with similar final-answer impact can have very different operational signatures. The caveat is that edit distance abstracts away lexical content, so it is useful for control-flow diagnosis, not for semantic correctness by itself.

The first-divergence plot adds timing. Some perturbations cause immediate routing or parsing changes, while others pass through early stages and surface later during reasoning or synthesis. This supports a practical design lesson: verification should not be a single final guardrail. Different contamination modes need checks near extraction, computation, or synthesis depending on where divergence first appears.
Quick idea: the paper injects controlled perturbations into artifact-derived representations, then compares clean and perturbed multi-agent traces to localize when information contamination changes behavior.
Why it matters: multi-agent systems increasingly process PDFs, spreadsheets, slides, images, and tool outputs. A bad OCR span, parsed table, or derived number may cross several agent boundaries before it affects an answer. Final-answer accuracy cannot tell whether the system ignored the perturbation, silently absorbed it, rerouted and recovered, or spent a lot of extra tokens before failing.
Method walkthrough:
- The authors instantiate a fixed multi-agent workflow over GAIA tasks with file attachments, including extraction, analysis, code-generation, validation, and coordination roles.
- They perturb artifact-derived representations rather than raw files, matching the place where agents consume parsed text, tables, images, or audio outputs.
- Each run logs routing decisions, tool calls, memory operations, agent outputs, and final outcomes as structured execution events.
- Clean and perturbed runs are compared through structural signatures and normalized edit distance, with a first divergence point used to localize where the cascade begins.
Evidence: the study covers 614 paired clean/perturbed runs across 32 GAIA tasks and three LLM backends. The key empirical result is a decoupling between trace behavior and final outcome. The paper reports 15.3% silent semantic corruption, where traces remain structurally similar while the outcome changes; 40.3% behavioral detours with recovery, where execution changes but the final outcome matches; and 39.9% combined structural and outcome disruption. Among divergent runs, rerouting appears in 80.6%, extended execution in 37.4%, and early termination in 25.3%. Cost is not a reliable proxy either: only 16.3% of high-cost runs above 2x overhead succeed, while 76.2% of low-cost runs below 1.2x overhead are incorrect.
My judgment: this is one of the more useful agent evaluation papers this week because it makes failure legible without pretending every divergence is bad. A detour can be healthy recovery; a stable trace can still be wrong. I would like the next step to connect these trace signatures to automated mitigations, such as targeted re-extraction, per-modality validators, and budget policies that escalate based on contamination class rather than generic uncertainty.
Connection to tracked themes: multi-agent reliability, document intelligence, trace-native evaluation, auditable workflows.
reward-lens: A Mechanistic Interpretability Library for Reward Models
Authors: Mohammed Suhail B. Nadaf.
Institutions: not specified.
Date/Venue: April 28, 2026, arXiv preprint.
Links: arXiv | HTML

The dashboard figure gives the paper’s basic object: instead of projecting intermediate states onto vocabulary logits, it projects them onto the reward head direction. That makes preference formation visible across layers for a reward model. The claim is not that the lens proves causality. It gives an observational view that can be compared against interventions.

This figure is the paper’s most important warning. Late MLP components dominate linear attribution, but early-layer components dominate activation patching in the shown Skywork helpfulness example. The visual supports the negative finding: where the reward score is linearly attributed is not necessarily where the model is causally load-bearing. For alignment work, that distinction is not cosmetic.

The trajectory overlay compares preference formation across Skywork and ArmoRM. The curves suggest that different reward architectures can form preferences along broadly similar depth trajectories while using different component sets. This is a useful bridge between mechanism and training practice, but the paper is careful that these are early empirical slices rather than a universal law of reward models.
Quick idea: reward-lens ports mechanistic interpretability tools to reward models by replacing the vocabulary unembedding target with the reward head vector (w_r).
Why it matters: RLHF systems rely on reward models, but most interpretability tools were built for generative LLMs. A logit lens asks how an intermediate state projects onto tokens through (W_U). A reward model does not end in (W_U); it ends in a scalar head. If we inspect only the policy model and not the reward model, we miss the component that shaped optimization pressure in the first place.
Method walkthrough:
- The paper centers the analysis on the reward head equation (r(x,y)=w_r^\top h_{\mathrm{final}}(x,y)+b_r). The vector (w_r) becomes the axis for lensing, attribution, SAE-feature alignment, and concept analysis.
- Reward Lens projects intermediate residual streams onto (w_r), producing a layer-wise preference trajectory.
- Component attribution decomposes attention and MLP contributions along the same axis.
- Activation patching swaps components between preferred and dispreferred completions, then measures the causal change in reward rather than relying on observational attribution.
Evidence: the paper evaluates Skywork-Reward-Llama-3.1-8B-v0.2 and ArmoRM-Llama3-8B-v0.1 over about 695 RewardBench preference pairs per model. Skywork preferences crystallize uniformly late, around fractional depth 0.90 across helpfulness, safety, correctness, and verbosity. ArmoRM crystallizes earlier and with higher variance, for example correctness at 0.64 plus or minus 0.34. The central result is negative but valuable: linear attribution does not predict causal patching effects. Mean Spearman correlation is -0.256 on Skywork and -0.027 on ArmoRM. In the Skywork helpfulness example, the top attribution components are late MLPs, while the strongest patching effects come from early MLP and attention components.
My judgment: I included this one because it answers the user’s point about depth and formulas. The formula is simple, but it changes the inspection target. The paper’s strongest contribution is the insistence that attribution and patching answer different questions. The weakest part is sample size around some hacking and cascade probes; the author is explicit that those results are directional. I would treat reward-lens as a tool paper whose value depends on follow-up replication across more reward models and larger probe sets.
Connection to tracked themes: large model mechanisms, agentic training, reward-model auditing, interpretability tooling.
Reading Priority And Next Questions
My order after this issue: Web2BigTable first if you care about data agents, Trace-Level Contamination first if you care about deployment review, Step-level Optimization first if you care about GUI-agent cost, and reward-lens first if you care about the training signal that shapes agent behavior.
Next questions I would track:
- Can shared workboards support cross-worker contradiction resolution, not only parallel row filling?
- Can GUI-agent cascades learn event detectors that transfer across products and slow real-world services?
- Can trace divergence trigger targeted validation policies before final-answer checking?
- Can reward-model interpretability be wired into RLHF training loops as a live alarm instead of a post-hoc notebook?