Posts by Tags

从编译式智能体到选择性记忆与深度研究推导债

3 minute read

Published: May 21, 2026

TL;DR：本期看的不是“模型又会了什么”，而是智能体周围的运行时怎么变得更可控。Agent JIT 把网页智能体任务编译成带状态不变量和延迟调度的可执行计划。Mem-pi 把记忆做成一个会选择是否发声的生成式策略。DeepWeb-Bench 则提醒我们，深度研究智能体的主要错误往往不在“找不到资料”，而在推导、校准和跨来源协调。

Compiled Agents, Adaptive Memory, and Derivation Debt

16 minute read

Published: May 21, 2026

TL;DR: this round is about the runtime around an agent, not only the agent model. Agent JIT treats web-agent plans as compilable code with state invariants and latency-aware scheduling. Mem-pi turns memory into a small generative policy that can decide not to speak. DeepWeb-Bench shows that deep-research agents often find sources but still fail at derivation, calibration, and cross-source reconciliation.

从可执行软件世界到物理结构化世界模型

5 minute read

Published: May 20, 2026

TL;DR：本期关注的是智能体训练和评测里的“世界”到底有多可靠。EnvFactory 自动合成可执行、可验证的工具环境，用来训练 tool-use agents；OpenComputer 把桌面软件任务做成可以读取真实应用状态的 verifier-grounded software worlds；PH-Dreamer 则把 Port-Hamiltonian 物理结构放进视觉世界模型，让 latent imagination 不只追奖励，也要更接近能量和运动约束。

Executable Worlds for Agents, Physical Priors for World Models

15 minute read

Published: May 20, 2026

TL;DR: this round is about making the world around an agent less imaginary. EnvFactory synthesizes executable tool environments for agentic RL instead of training only on over-specified traces. OpenComputer builds verifier-grounded desktop software worlds where computer-use agents can be scored by application state, not just screenshots or LLM judgment. PH-Dreamer puts a Port-Hamiltonian structure inside a visual world model, so latent imagination is pressured to respect energy and smoothness instead of only predicting reward.

先探索再行动，先组证据再作答

4 minute read

Published: May 18, 2026

TL;DR：本期我关注的是智能体和世界模型在作答或行动之前应该先做什么。一篇论文把“探索陌生环境”做成可训练能力，而不是指望任务奖励顺便学出来；一篇论文把 deep research 从多条搜索轨迹的投票，改成共享证据图的组装；一篇论文则追问视频 latent prediction 到底有没有学到更像世界模型的表示，而不只看干净分类准确率。

Exploration Before Action, Evidence Before Answers

15 minute read

Published: May 18, 2026

TL;DR: this round is about agents and world models that should not rush straight to an answer. One paper trains LLM agents to explore an unfamiliar environment before acting. One turns deep research into a shared evidence graph rather than a pile of independent search traces. One asks whether latent video prediction produces world-model representations that survive corruption, occlusion, contact ambiguity, and reversed time.

从路径选择到路由几何

4 minute read

Published: May 13, 2026

TL;DR：本期我关注的不是“智能体有没有中间状态”，而是这些中间结构怎样改变决策本身。ToolCUA 研究电脑使用智能体什么时候该继续点 GUI、什么时候该切到工具；DataMaster 把数据工程变成带分支、记忆和反馈的搜索；PriorZero 试图把语言模型先验接入世界模型规划，但只放在 MCTS 根节点；MoE 路由论文则从机理上解释 router 和 expert 为什么会学出共享几何，以及常见 load-balancing loss 可能怎样破坏这种几何。

Path Choices, Data Search, and Routing Geometry

15 minute read

Published: May 13, 2026

TL;DR: this round is about decision surfaces inside systems that look agentic from the outside. ToolCUA asks when a computer-use agent should stay with GUI actions and when it should call a tool. DataMaster treats data engineering as a branching search problem rather than a one-shot preprocessing step. PriorZero tries to use language priors without letting them corrupt a learned world model. The MoE routing paper is more mechanistic: it argues that routers and experts learn a shared geometry, and that common load-balancing losses can damage it.

在智能体失败之前读懂执行轨迹

6 minute read

Published: May 11, 2026

TL;DR：本期看的是智能体轨迹如何成为训练和诊断对象。A3 用 shell 命令结构给 CLI 智能体做逐步 credit assignment，而不是只看整条轨迹成败。Tool Calling 论文发现，模型内部的工具选择可以线性读出和干预，错误工具调用在 JSON 输出前就有信号。MASPrism 则用小模型 prefill 阶段的 NLL 和 attention，在长多智能体日志里定位可能的失败源头，而且不需要生成诊断长文。

Reading Agent Traces Before They Become Failures

17 minute read

Published: May 11, 2026

TL;DR: this round is about agent traces as training and diagnostic objects. A3 trains CLI agents by assigning credit to shell actions rather than only to whole trajectories. The tool-calling paper shows that tool choice is linearly readable and steerable inside language models before the JSON call is emitted. MASPrism uses small-model prefill signals to locate likely failure sources in long multi-agent logs without decoding a diagnostic answer.

在行动之前读懂隐藏状态

5 minute read

Published: May 10, 2026

简短 TL;DR：本期关注的是模型或智能体在行动之前到底“看见”了什么。Natural Language Autoencoders 把残差流激活翻译成自然语言，给模型审计提供一个可读入口。BAMI 诊断 GUI grounding 的坐标偏差，并在测试时纠正一部分错误点击。Sheet as Token 把多表格工作簿压缩成可检索的 sheet 级对象，让数据智能体不必一上来就吞整本 Excel。

Reading Hidden State Before Agents Act

17 minute read

Published: May 10, 2026

TL;DR: this round is about making hidden state readable before it becomes a wrong action. Natural Language Autoencoders translate residual-stream activations into text for model auditing. BAMI diagnoses GUI grounding failures and fixes some of them at test time. Sheet as Token turns messy multi-sheet workbooks into retrievable sheet-level objects for data agents.

让长程智能体会规划、会分工、也会验引文

4 minute read

Published: May 09, 2026

TL;DR：本期看长程智能体如何摆脱单一 reactive loop。StraTA 让 agent 在执行前先生成全局策略，并把后续动作都放在这个策略下训练；RAO 把递归 subagent 变成可训练的推理时扩展机制；引文归因评测论文则检查 deep research 报告里的引用是否真的支撑旁边那句话。

Strategies, Subagents, and Citation Checks for Long-Horizon Work

16 minute read

Published: May 09, 2026

TL;DR: this round is about long-horizon work that does not fit inside a single reactive loop. StraTA trains an agent to carry a global strategy through an episode. RAO trains a model to use recursive subagents rather than treating delegation as a fixed scaffold. The source-attribution paper asks whether deep-research citations actually support the claims they are attached to.

技能、检索与记忆化世界模型

4 minute read

Published: May 08, 2026

TL;DR：本期看的是智能体工作流里的“可操作状态”。SkillOS 让智能体从经验里学会维护技能库；SIRA 把多轮搜索压缩成一次有语料意识的词法检索动作；HaM-World 用选择性记忆和几何结构稳定规划用的世界模型潜变量。

Skills, Retrieval, and Memory for Agent Workflows

13 minute read

Published: May 08, 2026

TL;DR: this round is about agent-facing state. SkillOS learns how to curate reusable skills from experience. SIRA compresses retrieval into one corpus-discriminative lexical action. HaM-World gives planning a structured latent with memory and geometry so rollouts do not fall apart as quickly.

让中间产物能被智能体真正使用

4 minute read

Published: May 06, 2026

TL;DR：本期看的是“中间产物”能不能被下游系统真正消费。TraceLift 讨论推理轨迹是否应该按它对冻结执行器的帮助来奖励；BRIGHT-Pro 讨论检索器是否应该覆盖一组互补证据，而不是只找一个相关段落；Agentic-imodels 则把可解释模型重新定义为“智能体读得懂、能模拟”的工具。

Intermediate Work That Agents Can Actually Use

13 minute read

Published: May 06, 2026

TL;DR: this round is about intermediate work that another system has to consume. TraceLift asks whether a reasoning plan should be rewarded only when it helps a frozen executor. BRIGHT-Pro asks whether retrieval should cover an evidence portfolio rather than one relevant passage. Agentic-imodels asks whether an interpretable model should be readable by an agent, not only by a human analyst.

从搜索轨迹到记忆电路，再到世界理论

5 minute read

Published: May 06, 2026

TL;DR：本期我想看的是智能体给出答案之前已经学到的结构。OpenSeeker-v2 讨论高质量、长难度搜索轨迹能把纯 SFT 搜索智能体推到什么程度；agent memory circuit 论文把写入、管理、读取三个记忆阶段拆开做电路追踪；Learning-to-Theorize 则把 world model 从“预测下一帧”推向“从观察中归纳可执行理论”。

Training Signals, Memory Circuits, and Theories of the World

17 minute read

Published: May 06, 2026

TL;DR: this round is about structure that is learned before an agent produces the final answer. OpenSeeker-v2 asks how much frontier search-agent behavior can come from carefully filtered SFT trajectories. The agent-memory circuit paper opens the write-manage-read loop and shows that routing, extraction, and grounding emerge at different model scales. Learning-to-Theorize pushes world models away from pure prediction and toward executable, compositional theories inferred from raw observations.

先看清证据，再让模型回答

6 minute read

Published: May 05, 2026

TL;DR：本期没有退回 4 月 30 日，而是保留 5 月 3-4 日的新论文，主题是“模型在回答或生成之前应该先看什么”。FlexSQL 让 data agent 在推理过程中反复检查 schema、取值、执行结果和计划，而不是一次性把 schema retrieval 固定下来。Chart-FR1 把密集图表推理训练成显式视觉聚焦过程，让 reasoning step 绑定 OCR 文本和局部区域。PV-VAE 则把 video VAE 从纯重建改成预测式重建，迫使 latent 携带运动和未来变化信息。

Agents That Look Before They Answer

18 minute read

Published: May 05, 2026

TL;DR: this round stayed inside the fresh May 3-4 window and picked three papers about giving models a better inspection step before they answer or generate. FlexSQL lets a data agent revisit schemas, values, execution results, and plans instead of freezing retrieval up front. Chart-FR1 trains chart reasoning around explicit visual focus, so dense charts are not treated as one undifferentiated image. PV-VAE changes video VAE training from pure reconstruction to predictive reconstruction, pushing latents to carry motion-relevant structure rather than only pixel detail.

把证据从提示词里移出来

6 minute read

Published: May 03, 2026

TL;DR：5 月 1 日到 3 日目标主题里的 arXiv 新稿很少，所以本期在去重后扩展到 4 月 30 日最新窗口。我选了三篇都在把证据移出 prompt 的论文：Claw-Eval-Live 用动态需求信号刷新 workflow-agent 评测，并按可观察动作打分；ObjectGraph 把文档变成可遍历的有类型图，而不是整段注入上下文；CIRM 在推理时干预 reward model 激活，降低格式捷径变成训练标签的风险。

Evidence Surfaces for Agents Outside the Prompt

16 minute read

Published: May 03, 2026

TL;DR: the newest May 1-3 arXiv window was thin for the tracked topics, so I expanded to the freshest April 30 papers after deduplicating the existing Paper Radar list. I picked three papers that all move evidence out of the prompt: Claw-Eval-Live refreshes workflow-agent evaluation from live demand signals and grades observable action, ObjectGraph treats documents as traversable typed graphs rather than injected strings, and CIRM edits reward-model activations at inference time so style shortcuts are less likely to become training labels.

在漂移变成答案之前加检查

5 minute read

Published: May 03, 2026

TL;DR：这一期我关注的是系统在“看起来已经完成”之前，能不能先发现或修正漂移。5 月 1 日到 3 日目标方向里没有足够新的 arXiv CS 投稿，所以我扩展到 4 月 30 日最新窗口，选了四篇开放全文：PRISM 讨论多模态模型在 RLVR 前的预对齐，PhyCo 讨论视频生成里的物理属性控制，FCMBench-Video 讨论文档证据随时间展开时的评测，Latent Adversarial Detection 讨论多轮攻击意图在激活轨迹中的信号。

Checks Before Agents Drift

20 minute read

Published: May 03, 2026

TL;DR: this round is about checks that happen before an agent or multimodal system drifts into a polished but wrong outcome. I found no fresh May 1-3 arXiv CS submissions in the target topics, so I expanded to the newest April 30 window and selected four open papers: PRISM for pre-aligning multimodal policies before RLVR, PhyCo for injecting physical priors into video generation, FCMBench-Video for document evidence over time, and Latent Adversarial Detection for activation-level multi-turn safety probes.

让智能体有可操作的结构化表面

5 minute read

Published: May 02, 2026

TL;DR：这一期我关注的不是“再给智能体更多上下文”或“每一步都换成更强模型”，而是智能体工作时需要什么样的结构化操作表面。入选的四篇分别讨论：用共享工作板做大规模 web-to-table 数据抽取、用事件检测决定电脑使用智能体何时升级到大模型、用 trace 分析多智能体系统中的信息污染，以及把 reward model 的机理分析目标从词表投影换成 reward head 方向。

Operating Surfaces for Agents That Need to Stay Correct

16 minute read

Published: May 02, 2026

TL;DR: this round is about agents needing structured operating surfaces, not just longer context or more calls to a stronger model. I picked four papers after excluding the recent seen list: a web-to-table multi-agent system with a shared workboard, a step-level cascade for computer-use agents, a trace-level study of information contamination, and a reward-model interpretability library. The first three sit in the newest April 29-30 window; I expanded one slot to April 28 for the mechanism paper because the freshest sparse-autoencoder item had already been covered.

给科学研究智能体更好的结构化接口

4 minute read

Published: May 02, 2026

TL;DR：这期我看的不是“智能体又多会做题了”，而是科学研究智能体到底需要什么接口。四篇论文分别把方法演化图谱、领域 foundation model 调用、复杂文献发现、科学可视化 workflow 做成了更明确的结构。我的判断是：研究智能体真正缺的往往不是更长上下文，而是能被查询、调用、验证和回放的中间层。

Structured Interfaces for Scientific Agents

18 minute read

Published: May 02, 2026

TL;DR: this round is about scientific agents needing better interfaces to knowledge, tools, and intermediate workflow state. I picked four papers that make that interface explicit: a method-evolution graph for AI scientists, a framework that lets language agents call domain foundation models, a benchmark for hard literature discovery, and a controlled comparison of scientific-visualization agent interaction styles.

让数据与界面智能体面对硬证据

3 minute read

Published: May 01, 2026

TL;DR：这期我看的是“证据对象变硬”这件事。四篇论文都不满足于让智能体写一个像样的答案，而是继续追问：科学数据任务能不能被可执行 verifier 检查，表格问答能不能识别隐含预测意图，GUI 智能体能不能到达精确状态，LLM 生成的 reward 能不能在合适的训练阶段再部署。

Hard Evidence for Data and Workflow Agents

16 minute read

Published: May 01, 2026

TL;DR: this round is about evaluation objects getting harder. The four papers I chose are not mainly asking whether an agent writes a plausible answer. They ask whether it can pass an executable scientific checker, infer a hidden predictive intent from a table, hit an exact GUI state, or deploy a reward only when the learner is competent enough for that reward to mean something.

为什么智能体需要显式中间层

3 minute read

Published: May 01, 2026

TL;DR：这期我盯着同一件事。真正有意思的工作，不是把端到端系统再堆大一点，而是把中间层做显式：BEV token、latent CoT、概念流形、检索页。模型一旦有了可检查的中间表示，几何、控制和证据就都更容易对齐。

Why Agents Need an Explicit Middle Layer

14 minute read

Published: May 01, 2026

TL;DR: this round keeps circling one idea. The useful papers are not just bigger end-to-end systems, they are systems that make the middle layer explicit: BEV tokens, latent reasoning, manifold groups, retrieval pages. That is where the model gets a handle on geometry, control, or evidence.

让长程智能体拥有可回放工作区

4 minute read

Published: May 01, 2026

TL;DR： 本期关注一个更底层的问题：长程智能体的工作过程能不能被复现、恢复和审计。我选了四篇 4 月 30 日的新论文，因为它们分别把长程智能体里容易被忽略的层显式化：Synthetic Computers 构造用户级工作区，Crab 保存沙箱操作系统状态，Exploration Hacking 检验模型是否能通过控制探索来抵抗强化学习，COHERENCE 把图文交错文档理解变成可核验的对齐任务。

Replayable Workspaces for Long-Horizon Agents

19 minute read

Published: May 01, 2026

TL;DR: This round is about agents whose work can be replayed, repaired, and audited. I selected four April 30 papers because they each make one hidden layer of long-horizon agency explicit: Synthetic Computers builds user-specific workspaces for productivity simulation, Crab checkpoints the OS state of agent sandboxes, Exploration Hacking tests whether RL can be resisted by strategic under-exploration, and COHERENCE turns interleaved document understanding into a verifiable alignment task.

从闭环反馈到可审计工作流

3 minute read

Published: April 30, 2026

简短 TL;DR： 本期我把注意力从“智能体能不能刷过一个 benchmark”转到更难的问题：训练和工作流结束后，过程还能不能被核验。四篇论文分别给出四个入口：FutureWorld 用真实世界事件兑现后的结果做延迟奖励；AgentSim 生成可核验的 RAG 智能体轨迹；DV-World 把数据可视化智能体放进原生软件、跨框架演化和交互澄清流程；MoRFI 则从机制层看，后训练引入新事实时为什么可能损伤模型对旧知识的访问。

Closed Loops for Auditable Agents

17 minute read

Published: April 30, 2026

TL;DR: This round moves from “can an agent solve a benchmark?” toward a harder question: can the loop be audited after the fact? I selected four recent open papers because they put feedback and evidence into the system boundary: FutureWorld waits for real outcomes before assigning rewards, AgentSim records verifiable RAG traces, DV-World tests data-visualization agents in native and interactive workflows, and MoRFI probes how post-training on new facts leaves sparse mechanistic traces.

让长程智能体拥有可核验的状态

4 minute read

Published: April 30, 2026

简短 TL;DR： 本期关注一个很实用的变化：更强的智能体不只是更长上下文模型，而是能生成任务、核验中间状态、保存证据、并从世界状态变化中学习的系统。我选了四篇近期开放全文论文：ClawGym 做可执行、可核验的电脑使用任务；World2VLM 把世界模型想象变成训练信号；DataPRM 让数据分析奖励模型进入环境检查步骤；OCR-Memory 则把长程历史保存成可光学检索的证据。

Verifiable State for Long-Horizon Agents

16 minute read

Published: April 30, 2026

TL;DR: This round is about a practical shift in agent research: stronger agents are not only longer-context models, but systems that synthesize tasks, verify intermediate state, preserve evidence, and learn from world-state transitions. I selected four recent open papers because each one makes state more inspectable: ClawGym builds verifiable computer-use tasks, World2VLM turns world-model imagination into training data, DataPRM verifies data-analysis steps inside the environment, and OCR-Memory stores long agent histories as optically retrievable evidence.

How to Ensure the Consistency of Estimators When Dealing with Systematic Errors in Continuous Variable Systems

7 minute read

Published: November 30, 2025

TL;DR: Not all measurement error is fatal in panel data. In a Two-Way Fixed Effects (TWFE) model:
Constant and Time-Specific additive errors are harmless (absorbed by FEs).
Unit-Specific additive errors create bias when interacting a continuous regressor with a post-treatment dummy (Continuous DiD).
The Fix: You can eliminate this bias by including Unit × Post fixed effects or by using post-period centering (FWL residualization).

从编译式智能体到选择性记忆与深度研究推导债

3 minute read

Published: May 21, 2026

TL;DR：本期看的不是“模型又会了什么”，而是智能体周围的运行时怎么变得更可控。Agent JIT 把网页智能体任务编译成带状态不变量和延迟调度的可执行计划。Mem-pi 把记忆做成一个会选择是否发声的生成式策略。DeepWeb-Bench 则提醒我们，深度研究智能体的主要错误往往不在“找不到资料”，而在推导、校准和跨来源协调。

Compiled Agents, Adaptive Memory, and Derivation Debt

16 minute read

Published: May 21, 2026

TL;DR: this round is about the runtime around an agent, not only the agent model. Agent JIT treats web-agent plans as compilable code with state invariants and latency-aware scheduling. Mem-pi turns memory into a small generative policy that can decide not to speak. DeepWeb-Bench shows that deep-research agents often find sources but still fail at derivation, calibration, and cross-source reconciliation.

从可执行软件世界到物理结构化世界模型

5 minute read

Published: May 20, 2026

TL;DR：本期关注的是智能体训练和评测里的“世界”到底有多可靠。EnvFactory 自动合成可执行、可验证的工具环境，用来训练 tool-use agents；OpenComputer 把桌面软件任务做成可以读取真实应用状态的 verifier-grounded software worlds；PH-Dreamer 则把 Port-Hamiltonian 物理结构放进视觉世界模型，让 latent imagination 不只追奖励，也要更接近能量和运动约束。

Executable Worlds for Agents, Physical Priors for World Models

15 minute read

Published: May 20, 2026

TL;DR: this round is about making the world around an agent less imaginary. EnvFactory synthesizes executable tool environments for agentic RL instead of training only on over-specified traces. OpenComputer builds verifier-grounded desktop software worlds where computer-use agents can be scored by application state, not just screenshots or LLM judgment. PH-Dreamer puts a Port-Hamiltonian structure inside a visual world model, so latent imagination is pressured to respect energy and smoothness instead of only predicting reward.

先探索再行动，先组证据再作答

4 minute read

Published: May 18, 2026

TL;DR：本期我关注的是智能体和世界模型在作答或行动之前应该先做什么。一篇论文把“探索陌生环境”做成可训练能力，而不是指望任务奖励顺便学出来；一篇论文把 deep research 从多条搜索轨迹的投票，改成共享证据图的组装；一篇论文则追问视频 latent prediction 到底有没有学到更像世界模型的表示，而不只看干净分类准确率。

Exploration Before Action, Evidence Before Answers

15 minute read

Published: May 18, 2026

TL;DR: this round is about agents and world models that should not rush straight to an answer. One paper trains LLM agents to explore an unfamiliar environment before acting. One turns deep research into a shared evidence graph rather than a pile of independent search traces. One asks whether latent video prediction produces world-model representations that survive corruption, occlusion, contact ambiguity, and reversed time.

从路径选择到路由几何

4 minute read

Published: May 13, 2026

TL;DR：本期我关注的不是“智能体有没有中间状态”，而是这些中间结构怎样改变决策本身。ToolCUA 研究电脑使用智能体什么时候该继续点 GUI、什么时候该切到工具；DataMaster 把数据工程变成带分支、记忆和反馈的搜索；PriorZero 试图把语言模型先验接入世界模型规划，但只放在 MCTS 根节点；MoE 路由论文则从机理上解释 router 和 expert 为什么会学出共享几何，以及常见 load-balancing loss 可能怎样破坏这种几何。

Path Choices, Data Search, and Routing Geometry

15 minute read

Published: May 13, 2026

TL;DR: this round is about decision surfaces inside systems that look agentic from the outside. ToolCUA asks when a computer-use agent should stay with GUI actions and when it should call a tool. DataMaster treats data engineering as a branching search problem rather than a one-shot preprocessing step. PriorZero tries to use language priors without letting them corrupt a learned world model. The MoE routing paper is more mechanistic: it argues that routers and experts learn a shared geometry, and that common load-balancing losses can damage it.

在智能体失败之前读懂执行轨迹

6 minute read

Published: May 11, 2026

TL;DR：本期看的是智能体轨迹如何成为训练和诊断对象。A3 用 shell 命令结构给 CLI 智能体做逐步 credit assignment，而不是只看整条轨迹成败。Tool Calling 论文发现，模型内部的工具选择可以线性读出和干预，错误工具调用在 JSON 输出前就有信号。MASPrism 则用小模型 prefill 阶段的 NLL 和 attention，在长多智能体日志里定位可能的失败源头，而且不需要生成诊断长文。

Reading Agent Traces Before They Become Failures

17 minute read

Published: May 11, 2026

TL;DR: this round is about agent traces as training and diagnostic objects. A3 trains CLI agents by assigning credit to shell actions rather than only to whole trajectories. The tool-calling paper shows that tool choice is linearly readable and steerable inside language models before the JSON call is emitted. MASPrism uses small-model prefill signals to locate likely failure sources in long multi-agent logs without decoding a diagnostic answer.

在行动之前读懂隐藏状态

5 minute read

Published: May 10, 2026

简短 TL;DR：本期关注的是模型或智能体在行动之前到底“看见”了什么。Natural Language Autoencoders 把残差流激活翻译成自然语言，给模型审计提供一个可读入口。BAMI 诊断 GUI grounding 的坐标偏差，并在测试时纠正一部分错误点击。Sheet as Token 把多表格工作簿压缩成可检索的 sheet 级对象，让数据智能体不必一上来就吞整本 Excel。

Reading Hidden State Before Agents Act

17 minute read

Published: May 10, 2026

TL;DR: this round is about making hidden state readable before it becomes a wrong action. Natural Language Autoencoders translate residual-stream activations into text for model auditing. BAMI diagnoses GUI grounding failures and fixes some of them at test time. Sheet as Token turns messy multi-sheet workbooks into retrievable sheet-level objects for data agents.

让长程智能体会规划、会分工、也会验引文

4 minute read

Published: May 09, 2026

TL;DR：本期看长程智能体如何摆脱单一 reactive loop。StraTA 让 agent 在执行前先生成全局策略，并把后续动作都放在这个策略下训练；RAO 把递归 subagent 变成可训练的推理时扩展机制；引文归因评测论文则检查 deep research 报告里的引用是否真的支撑旁边那句话。

Strategies, Subagents, and Citation Checks for Long-Horizon Work

16 minute read

Published: May 09, 2026

TL;DR: this round is about long-horizon work that does not fit inside a single reactive loop. StraTA trains an agent to carry a global strategy through an episode. RAO trains a model to use recursive subagents rather than treating delegation as a fixed scaffold. The source-attribution paper asks whether deep-research citations actually support the claims they are attached to.

技能、检索与记忆化世界模型

4 minute read

Published: May 08, 2026

TL;DR：本期看的是智能体工作流里的“可操作状态”。SkillOS 让智能体从经验里学会维护技能库；SIRA 把多轮搜索压缩成一次有语料意识的词法检索动作；HaM-World 用选择性记忆和几何结构稳定规划用的世界模型潜变量。

Skills, Retrieval, and Memory for Agent Workflows

13 minute read

Published: May 08, 2026

TL;DR: this round is about agent-facing state. SkillOS learns how to curate reusable skills from experience. SIRA compresses retrieval into one corpus-discriminative lexical action. HaM-World gives planning a structured latent with memory and geometry so rollouts do not fall apart as quickly.

让中间产物能被智能体真正使用

4 minute read

Published: May 06, 2026

TL;DR：本期看的是“中间产物”能不能被下游系统真正消费。TraceLift 讨论推理轨迹是否应该按它对冻结执行器的帮助来奖励；BRIGHT-Pro 讨论检索器是否应该覆盖一组互补证据，而不是只找一个相关段落；Agentic-imodels 则把可解释模型重新定义为“智能体读得懂、能模拟”的工具。

Intermediate Work That Agents Can Actually Use

13 minute read

Published: May 06, 2026

TL;DR: this round is about intermediate work that another system has to consume. TraceLift asks whether a reasoning plan should be rewarded only when it helps a frozen executor. BRIGHT-Pro asks whether retrieval should cover an evidence portfolio rather than one relevant passage. Agentic-imodels asks whether an interpretable model should be readable by an agent, not only by a human analyst.

从搜索轨迹到记忆电路，再到世界理论

5 minute read

Published: May 06, 2026

TL;DR：本期我想看的是智能体给出答案之前已经学到的结构。OpenSeeker-v2 讨论高质量、长难度搜索轨迹能把纯 SFT 搜索智能体推到什么程度；agent memory circuit 论文把写入、管理、读取三个记忆阶段拆开做电路追踪；Learning-to-Theorize 则把 world model 从“预测下一帧”推向“从观察中归纳可执行理论”。

Training Signals, Memory Circuits, and Theories of the World

17 minute read

Published: May 06, 2026

TL;DR: this round is about structure that is learned before an agent produces the final answer. OpenSeeker-v2 asks how much frontier search-agent behavior can come from carefully filtered SFT trajectories. The agent-memory circuit paper opens the write-manage-read loop and shows that routing, extraction, and grounding emerge at different model scales. Learning-to-Theorize pushes world models away from pure prediction and toward executable, compositional theories inferred from raw observations.

先看清证据，再让模型回答

6 minute read

Published: May 05, 2026

TL;DR：本期没有退回 4 月 30 日，而是保留 5 月 3-4 日的新论文，主题是“模型在回答或生成之前应该先看什么”。FlexSQL 让 data agent 在推理过程中反复检查 schema、取值、执行结果和计划，而不是一次性把 schema retrieval 固定下来。Chart-FR1 把密集图表推理训练成显式视觉聚焦过程，让 reasoning step 绑定 OCR 文本和局部区域。PV-VAE 则把 video VAE 从纯重建改成预测式重建，迫使 latent 携带运动和未来变化信息。

Agents That Look Before They Answer

18 minute read

Published: May 05, 2026

TL;DR: this round stayed inside the fresh May 3-4 window and picked three papers about giving models a better inspection step before they answer or generate. FlexSQL lets a data agent revisit schemas, values, execution results, and plans instead of freezing retrieval up front. Chart-FR1 trains chart reasoning around explicit visual focus, so dense charts are not treated as one undifferentiated image. PV-VAE changes video VAE training from pure reconstruction to predictive reconstruction, pushing latents to carry motion-relevant structure rather than only pixel detail.

把证据从提示词里移出来

6 minute read

Published: May 03, 2026

TL;DR：5 月 1 日到 3 日目标主题里的 arXiv 新稿很少，所以本期在去重后扩展到 4 月 30 日最新窗口。我选了三篇都在把证据移出 prompt 的论文：Claw-Eval-Live 用动态需求信号刷新 workflow-agent 评测，并按可观察动作打分；ObjectGraph 把文档变成可遍历的有类型图，而不是整段注入上下文；CIRM 在推理时干预 reward model 激活，降低格式捷径变成训练标签的风险。

Evidence Surfaces for Agents Outside the Prompt

16 minute read

Published: May 03, 2026

TL;DR: the newest May 1-3 arXiv window was thin for the tracked topics, so I expanded to the freshest April 30 papers after deduplicating the existing Paper Radar list. I picked three papers that all move evidence out of the prompt: Claw-Eval-Live refreshes workflow-agent evaluation from live demand signals and grades observable action, ObjectGraph treats documents as traversable typed graphs rather than injected strings, and CIRM edits reward-model activations at inference time so style shortcuts are less likely to become training labels.

在漂移变成答案之前加检查

5 minute read

Published: May 03, 2026

TL;DR：这一期我关注的是系统在“看起来已经完成”之前，能不能先发现或修正漂移。5 月 1 日到 3 日目标方向里没有足够新的 arXiv CS 投稿，所以我扩展到 4 月 30 日最新窗口，选了四篇开放全文：PRISM 讨论多模态模型在 RLVR 前的预对齐，PhyCo 讨论视频生成里的物理属性控制，FCMBench-Video 讨论文档证据随时间展开时的评测，Latent Adversarial Detection 讨论多轮攻击意图在激活轨迹中的信号。

Checks Before Agents Drift

20 minute read

Published: May 03, 2026

TL;DR: this round is about checks that happen before an agent or multimodal system drifts into a polished but wrong outcome. I found no fresh May 1-3 arXiv CS submissions in the target topics, so I expanded to the newest April 30 window and selected four open papers: PRISM for pre-aligning multimodal policies before RLVR, PhyCo for injecting physical priors into video generation, FCMBench-Video for document evidence over time, and Latent Adversarial Detection for activation-level multi-turn safety probes.

让智能体有可操作的结构化表面

5 minute read

Published: May 02, 2026

TL;DR：这一期我关注的不是“再给智能体更多上下文”或“每一步都换成更强模型”，而是智能体工作时需要什么样的结构化操作表面。入选的四篇分别讨论：用共享工作板做大规模 web-to-table 数据抽取、用事件检测决定电脑使用智能体何时升级到大模型、用 trace 分析多智能体系统中的信息污染，以及把 reward model 的机理分析目标从词表投影换成 reward head 方向。

Operating Surfaces for Agents That Need to Stay Correct

16 minute read

Published: May 02, 2026

TL;DR: this round is about agents needing structured operating surfaces, not just longer context or more calls to a stronger model. I picked four papers after excluding the recent seen list: a web-to-table multi-agent system with a shared workboard, a step-level cascade for computer-use agents, a trace-level study of information contamination, and a reward-model interpretability library. The first three sit in the newest April 29-30 window; I expanded one slot to April 28 for the mechanism paper because the freshest sparse-autoencoder item had already been covered.

给科学研究智能体更好的结构化接口

4 minute read

Published: May 02, 2026

TL;DR：这期我看的不是“智能体又多会做题了”，而是科学研究智能体到底需要什么接口。四篇论文分别把方法演化图谱、领域 foundation model 调用、复杂文献发现、科学可视化 workflow 做成了更明确的结构。我的判断是：研究智能体真正缺的往往不是更长上下文，而是能被查询、调用、验证和回放的中间层。

Structured Interfaces for Scientific Agents

18 minute read

Published: May 02, 2026

TL;DR: this round is about scientific agents needing better interfaces to knowledge, tools, and intermediate workflow state. I picked four papers that make that interface explicit: a method-evolution graph for AI scientists, a framework that lets language agents call domain foundation models, a benchmark for hard literature discovery, and a controlled comparison of scientific-visualization agent interaction styles.

让数据与界面智能体面对硬证据

3 minute read

Published: May 01, 2026

TL;DR：这期我看的是“证据对象变硬”这件事。四篇论文都不满足于让智能体写一个像样的答案，而是继续追问：科学数据任务能不能被可执行 verifier 检查，表格问答能不能识别隐含预测意图，GUI 智能体能不能到达精确状态，LLM 生成的 reward 能不能在合适的训练阶段再部署。

Hard Evidence for Data and Workflow Agents

16 minute read

Published: May 01, 2026

TL;DR: this round is about evaluation objects getting harder. The four papers I chose are not mainly asking whether an agent writes a plausible answer. They ask whether it can pass an executable scientific checker, infer a hidden predictive intent from a table, hit an exact GUI state, or deploy a reward only when the learner is competent enough for that reward to mean something.

为什么智能体需要显式中间层

3 minute read

Published: May 01, 2026

TL;DR：这期我盯着同一件事。真正有意思的工作，不是把端到端系统再堆大一点，而是把中间层做显式：BEV token、latent CoT、概念流形、检索页。模型一旦有了可检查的中间表示，几何、控制和证据就都更容易对齐。

Why Agents Need an Explicit Middle Layer

14 minute read

Published: May 01, 2026

TL;DR: this round keeps circling one idea. The useful papers are not just bigger end-to-end systems, they are systems that make the middle layer explicit: BEV tokens, latent reasoning, manifold groups, retrieval pages. That is where the model gets a handle on geometry, control, or evidence.

让长程智能体拥有可回放工作区

4 minute read

Published: May 01, 2026

TL;DR： 本期关注一个更底层的问题：长程智能体的工作过程能不能被复现、恢复和审计。我选了四篇 4 月 30 日的新论文，因为它们分别把长程智能体里容易被忽略的层显式化：Synthetic Computers 构造用户级工作区，Crab 保存沙箱操作系统状态，Exploration Hacking 检验模型是否能通过控制探索来抵抗强化学习，COHERENCE 把图文交错文档理解变成可核验的对齐任务。

Replayable Workspaces for Long-Horizon Agents

19 minute read

Published: May 01, 2026

TL;DR: This round is about agents whose work can be replayed, repaired, and audited. I selected four April 30 papers because they each make one hidden layer of long-horizon agency explicit: Synthetic Computers builds user-specific workspaces for productivity simulation, Crab checkpoints the OS state of agent sandboxes, Exploration Hacking tests whether RL can be resisted by strategic under-exploration, and COHERENCE turns interleaved document understanding into a verifiable alignment task.

从闭环反馈到可审计工作流

3 minute read

Published: April 30, 2026

简短 TL;DR： 本期我把注意力从“智能体能不能刷过一个 benchmark”转到更难的问题：训练和工作流结束后，过程还能不能被核验。四篇论文分别给出四个入口：FutureWorld 用真实世界事件兑现后的结果做延迟奖励；AgentSim 生成可核验的 RAG 智能体轨迹；DV-World 把数据可视化智能体放进原生软件、跨框架演化和交互澄清流程；MoRFI 则从机制层看，后训练引入新事实时为什么可能损伤模型对旧知识的访问。

Closed Loops for Auditable Agents

17 minute read

Published: April 30, 2026

TL;DR: This round moves from “can an agent solve a benchmark?” toward a harder question: can the loop be audited after the fact? I selected four recent open papers because they put feedback and evidence into the system boundary: FutureWorld waits for real outcomes before assigning rewards, AgentSim records verifiable RAG traces, DV-World tests data-visualization agents in native and interactive workflows, and MoRFI probes how post-training on new facts leaves sparse mechanistic traces.

让长程智能体拥有可核验的状态

4 minute read

Published: April 30, 2026

简短 TL;DR： 本期关注一个很实用的变化：更强的智能体不只是更长上下文模型，而是能生成任务、核验中间状态、保存证据、并从世界状态变化中学习的系统。我选了四篇近期开放全文论文：ClawGym 做可执行、可核验的电脑使用任务；World2VLM 把世界模型想象变成训练信号；DataPRM 让数据分析奖励模型进入环境检查步骤；OCR-Memory 则把长程历史保存成可光学检索的证据。

Verifiable State for Long-Horizon Agents

16 minute read

Published: April 30, 2026

TL;DR: This round is about a practical shift in agent research: stronger agents are not only longer-context models, but systems that synthesize tasks, verify intermediate state, preserve evidence, and learn from world-state transitions. I selected four recent open papers because each one makes state more inspectable: ClawGym builds verifiable computer-use tasks, World2VLM turns world-model imagination into training data, DataPRM verifies data-analysis steps inside the environment, and OCR-Memory stores long agent histories as optically retrievable evidence.

How to Ensure the Consistency of Estimators When Dealing with Systematic Errors in Continuous Variable Systems

7 minute read

Published: November 30, 2025

TL;DR: Not all measurement error is fatal in panel data. In a Two-Way Fixed Effects (TWFE) model:
Constant and Time-Specific additive errors are harmless (absorbed by FEs).
Unit-Specific additive errors create bias when interacting a continuous regressor with a post-treatment dummy (Continuous DiD).
The Fix: You can eliminate this bias by including Unit × Post fixed effects or by using post-period centering (FWL residualization).

How to Ensure the Consistency of Estimators When Dealing with Systematic Errors in Continuous Variable Systems

7 minute read

Published: November 30, 2025

TL;DR: Not all measurement error is fatal in panel data. In a Two-Way Fixed Effects (TWFE) model:
Constant and Time-Specific additive errors are harmless (absorbed by FEs).
Unit-Specific additive errors create bias when interacting a continuous regressor with a post-treatment dummy (Continuous DiD).
The Fix: You can eliminate this bias by including Unit × Post fixed effects or by using post-period centering (FWL residualization).

从编译式智能体到选择性记忆与深度研究推导债

3 minute read

Published: May 21, 2026

TL;DR：本期看的不是“模型又会了什么”，而是智能体周围的运行时怎么变得更可控。Agent JIT 把网页智能体任务编译成带状态不变量和延迟调度的可执行计划。Mem-pi 把记忆做成一个会选择是否发声的生成式策略。DeepWeb-Bench 则提醒我们，深度研究智能体的主要错误往往不在“找不到资料”，而在推导、校准和跨来源协调。

Compiled Agents, Adaptive Memory, and Derivation Debt

16 minute read

Published: May 21, 2026

TL;DR: this round is about the runtime around an agent, not only the agent model. Agent JIT treats web-agent plans as compilable code with state invariants and latency-aware scheduling. Mem-pi turns memory into a small generative policy that can decide not to speak. DeepWeb-Bench shows that deep-research agents often find sources but still fail at derivation, calibration, and cross-source reconciliation.

从可执行软件世界到物理结构化世界模型

5 minute read

Published: May 20, 2026

TL;DR：本期关注的是智能体训练和评测里的“世界”到底有多可靠。EnvFactory 自动合成可执行、可验证的工具环境，用来训练 tool-use agents；OpenComputer 把桌面软件任务做成可以读取真实应用状态的 verifier-grounded software worlds；PH-Dreamer 则把 Port-Hamiltonian 物理结构放进视觉世界模型，让 latent imagination 不只追奖励，也要更接近能量和运动约束。

Executable Worlds for Agents, Physical Priors for World Models

15 minute read

Published: May 20, 2026

TL;DR: this round is about making the world around an agent less imaginary. EnvFactory synthesizes executable tool environments for agentic RL instead of training only on over-specified traces. OpenComputer builds verifier-grounded desktop software worlds where computer-use agents can be scored by application state, not just screenshots or LLM judgment. PH-Dreamer puts a Port-Hamiltonian structure inside a visual world model, so latent imagination is pressured to respect energy and smoothness instead of only predicting reward.

先探索再行动，先组证据再作答

4 minute read

Published: May 18, 2026

TL;DR：本期我关注的是智能体和世界模型在作答或行动之前应该先做什么。一篇论文把“探索陌生环境”做成可训练能力，而不是指望任务奖励顺便学出来；一篇论文把 deep research 从多条搜索轨迹的投票，改成共享证据图的组装；一篇论文则追问视频 latent prediction 到底有没有学到更像世界模型的表示，而不只看干净分类准确率。

Exploration Before Action, Evidence Before Answers

15 minute read

Published: May 18, 2026

TL;DR: this round is about agents and world models that should not rush straight to an answer. One paper trains LLM agents to explore an unfamiliar environment before acting. One turns deep research into a shared evidence graph rather than a pile of independent search traces. One asks whether latent video prediction produces world-model representations that survive corruption, occlusion, contact ambiguity, and reversed time.

从路径选择到路由几何

4 minute read

Published: May 13, 2026

TL;DR：本期我关注的不是“智能体有没有中间状态”，而是这些中间结构怎样改变决策本身。ToolCUA 研究电脑使用智能体什么时候该继续点 GUI、什么时候该切到工具；DataMaster 把数据工程变成带分支、记忆和反馈的搜索；PriorZero 试图把语言模型先验接入世界模型规划，但只放在 MCTS 根节点；MoE 路由论文则从机理上解释 router 和 expert 为什么会学出共享几何，以及常见 load-balancing loss 可能怎样破坏这种几何。

Path Choices, Data Search, and Routing Geometry

15 minute read

Published: May 13, 2026

TL;DR: this round is about decision surfaces inside systems that look agentic from the outside. ToolCUA asks when a computer-use agent should stay with GUI actions and when it should call a tool. DataMaster treats data engineering as a branching search problem rather than a one-shot preprocessing step. PriorZero tries to use language priors without letting them corrupt a learned world model. The MoE routing paper is more mechanistic: it argues that routers and experts learn a shared geometry, and that common load-balancing losses can damage it.

在智能体失败之前读懂执行轨迹

6 minute read

Published: May 11, 2026

TL;DR：本期看的是智能体轨迹如何成为训练和诊断对象。A3 用 shell 命令结构给 CLI 智能体做逐步 credit assignment，而不是只看整条轨迹成败。Tool Calling 论文发现，模型内部的工具选择可以线性读出和干预，错误工具调用在 JSON 输出前就有信号。MASPrism 则用小模型 prefill 阶段的 NLL 和 attention，在长多智能体日志里定位可能的失败源头，而且不需要生成诊断长文。

Reading Agent Traces Before They Become Failures

17 minute read

Published: May 11, 2026

TL;DR: this round is about agent traces as training and diagnostic objects. A3 trains CLI agents by assigning credit to shell actions rather than only to whole trajectories. The tool-calling paper shows that tool choice is linearly readable and steerable inside language models before the JSON call is emitted. MASPrism uses small-model prefill signals to locate likely failure sources in long multi-agent logs without decoding a diagnostic answer.

在行动之前读懂隐藏状态

5 minute read

Published: May 10, 2026

简短 TL;DR：本期关注的是模型或智能体在行动之前到底“看见”了什么。Natural Language Autoencoders 把残差流激活翻译成自然语言，给模型审计提供一个可读入口。BAMI 诊断 GUI grounding 的坐标偏差，并在测试时纠正一部分错误点击。Sheet as Token 把多表格工作簿压缩成可检索的 sheet 级对象，让数据智能体不必一上来就吞整本 Excel。

Reading Hidden State Before Agents Act

17 minute read

Published: May 10, 2026

TL;DR: this round is about making hidden state readable before it becomes a wrong action. Natural Language Autoencoders translate residual-stream activations into text for model auditing. BAMI diagnoses GUI grounding failures and fixes some of them at test time. Sheet as Token turns messy multi-sheet workbooks into retrievable sheet-level objects for data agents.

让长程智能体会规划、会分工、也会验引文

4 minute read

Published: May 09, 2026

TL;DR：本期看长程智能体如何摆脱单一 reactive loop。StraTA 让 agent 在执行前先生成全局策略，并把后续动作都放在这个策略下训练；RAO 把递归 subagent 变成可训练的推理时扩展机制；引文归因评测论文则检查 deep research 报告里的引用是否真的支撑旁边那句话。

Strategies, Subagents, and Citation Checks for Long-Horizon Work

16 minute read

Published: May 09, 2026

TL;DR: this round is about long-horizon work that does not fit inside a single reactive loop. StraTA trains an agent to carry a global strategy through an episode. RAO trains a model to use recursive subagents rather than treating delegation as a fixed scaffold. The source-attribution paper asks whether deep-research citations actually support the claims they are attached to.

技能、检索与记忆化世界模型

4 minute read

Published: May 08, 2026

TL;DR：本期看的是智能体工作流里的“可操作状态”。SkillOS 让智能体从经验里学会维护技能库；SIRA 把多轮搜索压缩成一次有语料意识的词法检索动作；HaM-World 用选择性记忆和几何结构稳定规划用的世界模型潜变量。

Skills, Retrieval, and Memory for Agent Workflows

13 minute read

Published: May 08, 2026

TL;DR: this round is about agent-facing state. SkillOS learns how to curate reusable skills from experience. SIRA compresses retrieval into one corpus-discriminative lexical action. HaM-World gives planning a structured latent with memory and geometry so rollouts do not fall apart as quickly.

让中间产物能被智能体真正使用

4 minute read

Published: May 06, 2026

TL;DR：本期看的是“中间产物”能不能被下游系统真正消费。TraceLift 讨论推理轨迹是否应该按它对冻结执行器的帮助来奖励；BRIGHT-Pro 讨论检索器是否应该覆盖一组互补证据，而不是只找一个相关段落；Agentic-imodels 则把可解释模型重新定义为“智能体读得懂、能模拟”的工具。

Intermediate Work That Agents Can Actually Use

13 minute read

Published: May 06, 2026

TL;DR: this round is about intermediate work that another system has to consume. TraceLift asks whether a reasoning plan should be rewarded only when it helps a frozen executor. BRIGHT-Pro asks whether retrieval should cover an evidence portfolio rather than one relevant passage. Agentic-imodels asks whether an interpretable model should be readable by an agent, not only by a human analyst.

从搜索轨迹到记忆电路，再到世界理论

5 minute read

Published: May 06, 2026

TL;DR：本期我想看的是智能体给出答案之前已经学到的结构。OpenSeeker-v2 讨论高质量、长难度搜索轨迹能把纯 SFT 搜索智能体推到什么程度；agent memory circuit 论文把写入、管理、读取三个记忆阶段拆开做电路追踪；Learning-to-Theorize 则把 world model 从“预测下一帧”推向“从观察中归纳可执行理论”。

Training Signals, Memory Circuits, and Theories of the World

17 minute read

Published: May 06, 2026

TL;DR: this round is about structure that is learned before an agent produces the final answer. OpenSeeker-v2 asks how much frontier search-agent behavior can come from carefully filtered SFT trajectories. The agent-memory circuit paper opens the write-manage-read loop and shows that routing, extraction, and grounding emerge at different model scales. Learning-to-Theorize pushes world models away from pure prediction and toward executable, compositional theories inferred from raw observations.

先看清证据，再让模型回答

6 minute read

Published: May 05, 2026

TL;DR：本期没有退回 4 月 30 日，而是保留 5 月 3-4 日的新论文，主题是“模型在回答或生成之前应该先看什么”。FlexSQL 让 data agent 在推理过程中反复检查 schema、取值、执行结果和计划，而不是一次性把 schema retrieval 固定下来。Chart-FR1 把密集图表推理训练成显式视觉聚焦过程，让 reasoning step 绑定 OCR 文本和局部区域。PV-VAE 则把 video VAE 从纯重建改成预测式重建，迫使 latent 携带运动和未来变化信息。

Agents That Look Before They Answer

18 minute read

Published: May 05, 2026

TL;DR: this round stayed inside the fresh May 3-4 window and picked three papers about giving models a better inspection step before they answer or generate. FlexSQL lets a data agent revisit schemas, values, execution results, and plans instead of freezing retrieval up front. Chart-FR1 trains chart reasoning around explicit visual focus, so dense charts are not treated as one undifferentiated image. PV-VAE changes video VAE training from pure reconstruction to predictive reconstruction, pushing latents to carry motion-relevant structure rather than only pixel detail.

把证据从提示词里移出来

6 minute read

Published: May 03, 2026

TL;DR：5 月 1 日到 3 日目标主题里的 arXiv 新稿很少，所以本期在去重后扩展到 4 月 30 日最新窗口。我选了三篇都在把证据移出 prompt 的论文：Claw-Eval-Live 用动态需求信号刷新 workflow-agent 评测，并按可观察动作打分；ObjectGraph 把文档变成可遍历的有类型图，而不是整段注入上下文；CIRM 在推理时干预 reward model 激活，降低格式捷径变成训练标签的风险。

Evidence Surfaces for Agents Outside the Prompt

16 minute read

Published: May 03, 2026

TL;DR: the newest May 1-3 arXiv window was thin for the tracked topics, so I expanded to the freshest April 30 papers after deduplicating the existing Paper Radar list. I picked three papers that all move evidence out of the prompt: Claw-Eval-Live refreshes workflow-agent evaluation from live demand signals and grades observable action, ObjectGraph treats documents as traversable typed graphs rather than injected strings, and CIRM edits reward-model activations at inference time so style shortcuts are less likely to become training labels.

在漂移变成答案之前加检查

5 minute read

Published: May 03, 2026

TL;DR：这一期我关注的是系统在“看起来已经完成”之前，能不能先发现或修正漂移。5 月 1 日到 3 日目标方向里没有足够新的 arXiv CS 投稿，所以我扩展到 4 月 30 日最新窗口，选了四篇开放全文：PRISM 讨论多模态模型在 RLVR 前的预对齐，PhyCo 讨论视频生成里的物理属性控制，FCMBench-Video 讨论文档证据随时间展开时的评测，Latent Adversarial Detection 讨论多轮攻击意图在激活轨迹中的信号。

Checks Before Agents Drift

20 minute read

Published: May 03, 2026

TL;DR: this round is about checks that happen before an agent or multimodal system drifts into a polished but wrong outcome. I found no fresh May 1-3 arXiv CS submissions in the target topics, so I expanded to the newest April 30 window and selected four open papers: PRISM for pre-aligning multimodal policies before RLVR, PhyCo for injecting physical priors into video generation, FCMBench-Video for document evidence over time, and Latent Adversarial Detection for activation-level multi-turn safety probes.

让智能体有可操作的结构化表面

5 minute read

Published: May 02, 2026

TL;DR：这一期我关注的不是“再给智能体更多上下文”或“每一步都换成更强模型”，而是智能体工作时需要什么样的结构化操作表面。入选的四篇分别讨论：用共享工作板做大规模 web-to-table 数据抽取、用事件检测决定电脑使用智能体何时升级到大模型、用 trace 分析多智能体系统中的信息污染，以及把 reward model 的机理分析目标从词表投影换成 reward head 方向。

Operating Surfaces for Agents That Need to Stay Correct

16 minute read

Published: May 02, 2026

TL;DR: this round is about agents needing structured operating surfaces, not just longer context or more calls to a stronger model. I picked four papers after excluding the recent seen list: a web-to-table multi-agent system with a shared workboard, a step-level cascade for computer-use agents, a trace-level study of information contamination, and a reward-model interpretability library. The first three sit in the newest April 29-30 window; I expanded one slot to April 28 for the mechanism paper because the freshest sparse-autoencoder item had already been covered.

给科学研究智能体更好的结构化接口

4 minute read

Published: May 02, 2026

TL;DR：这期我看的不是“智能体又多会做题了”，而是科学研究智能体到底需要什么接口。四篇论文分别把方法演化图谱、领域 foundation model 调用、复杂文献发现、科学可视化 workflow 做成了更明确的结构。我的判断是：研究智能体真正缺的往往不是更长上下文，而是能被查询、调用、验证和回放的中间层。

Structured Interfaces for Scientific Agents

18 minute read

Published: May 02, 2026

TL;DR: this round is about scientific agents needing better interfaces to knowledge, tools, and intermediate workflow state. I picked four papers that make that interface explicit: a method-evolution graph for AI scientists, a framework that lets language agents call domain foundation models, a benchmark for hard literature discovery, and a controlled comparison of scientific-visualization agent interaction styles.

让数据与界面智能体面对硬证据

3 minute read

Published: May 01, 2026

TL;DR：这期我看的是“证据对象变硬”这件事。四篇论文都不满足于让智能体写一个像样的答案，而是继续追问：科学数据任务能不能被可执行 verifier 检查，表格问答能不能识别隐含预测意图，GUI 智能体能不能到达精确状态，LLM 生成的 reward 能不能在合适的训练阶段再部署。

Hard Evidence for Data and Workflow Agents

16 minute read

Published: May 01, 2026

TL;DR: this round is about evaluation objects getting harder. The four papers I chose are not mainly asking whether an agent writes a plausible answer. They ask whether it can pass an executable scientific checker, infer a hidden predictive intent from a table, hit an exact GUI state, or deploy a reward only when the learner is competent enough for that reward to mean something.

为什么智能体需要显式中间层

3 minute read

Published: May 01, 2026

TL;DR：这期我盯着同一件事。真正有意思的工作，不是把端到端系统再堆大一点，而是把中间层做显式：BEV token、latent CoT、概念流形、检索页。模型一旦有了可检查的中间表示，几何、控制和证据就都更容易对齐。

Why Agents Need an Explicit Middle Layer

14 minute read

Published: May 01, 2026

TL;DR: this round keeps circling one idea. The useful papers are not just bigger end-to-end systems, they are systems that make the middle layer explicit: BEV tokens, latent reasoning, manifold groups, retrieval pages. That is where the model gets a handle on geometry, control, or evidence.

让长程智能体拥有可回放工作区

4 minute read

Published: May 01, 2026

TL;DR： 本期关注一个更底层的问题：长程智能体的工作过程能不能被复现、恢复和审计。我选了四篇 4 月 30 日的新论文，因为它们分别把长程智能体里容易被忽略的层显式化：Synthetic Computers 构造用户级工作区，Crab 保存沙箱操作系统状态，Exploration Hacking 检验模型是否能通过控制探索来抵抗强化学习，COHERENCE 把图文交错文档理解变成可核验的对齐任务。

Replayable Workspaces for Long-Horizon Agents

19 minute read

Published: May 01, 2026

TL;DR: This round is about agents whose work can be replayed, repaired, and audited. I selected four April 30 papers because they each make one hidden layer of long-horizon agency explicit: Synthetic Computers builds user-specific workspaces for productivity simulation, Crab checkpoints the OS state of agent sandboxes, Exploration Hacking tests whether RL can be resisted by strategic under-exploration, and COHERENCE turns interleaved document understanding into a verifiable alignment task.

从闭环反馈到可审计工作流

3 minute read

Published: April 30, 2026

简短 TL;DR： 本期我把注意力从“智能体能不能刷过一个 benchmark”转到更难的问题：训练和工作流结束后，过程还能不能被核验。四篇论文分别给出四个入口：FutureWorld 用真实世界事件兑现后的结果做延迟奖励；AgentSim 生成可核验的 RAG 智能体轨迹；DV-World 把数据可视化智能体放进原生软件、跨框架演化和交互澄清流程；MoRFI 则从机制层看，后训练引入新事实时为什么可能损伤模型对旧知识的访问。

Closed Loops for Auditable Agents

17 minute read

Published: April 30, 2026

TL;DR: This round moves from “can an agent solve a benchmark?” toward a harder question: can the loop be audited after the fact? I selected four recent open papers because they put feedback and evidence into the system boundary: FutureWorld waits for real outcomes before assigning rewards, AgentSim records verifiable RAG traces, DV-World tests data-visualization agents in native and interactive workflows, and MoRFI probes how post-training on new facts leaves sparse mechanistic traces.

让长程智能体拥有可核验的状态

4 minute read

Published: April 30, 2026

简短 TL;DR： 本期关注一个很实用的变化：更强的智能体不只是更长上下文模型，而是能生成任务、核验中间状态、保存证据、并从世界状态变化中学习的系统。我选了四篇近期开放全文论文：ClawGym 做可执行、可核验的电脑使用任务；World2VLM 把世界模型想象变成训练信号；DataPRM 让数据分析奖励模型进入环境检查步骤；OCR-Memory 则把长程历史保存成可光学检索的证据。

Verifiable State for Long-Horizon Agents

16 minute read

Published: April 30, 2026

TL;DR: This round is about a practical shift in agent research: stronger agents are not only longer-context models, but systems that synthesize tasks, verify intermediate state, preserve evidence, and learn from world-state transitions. I selected four recent open papers because each one makes state more inspectable: ClawGym builds verifiable computer-use tasks, World2VLM turns world-model imagination into training data, DataPRM verifies data-analysis steps inside the environment, and OCR-Memory stores long agent histories as optically retrievable evidence.

从编译式智能体到选择性记忆与深度研究推导债

3 minute read

Published: May 21, 2026

TL;DR：本期看的不是“模型又会了什么”，而是智能体周围的运行时怎么变得更可控。Agent JIT 把网页智能体任务编译成带状态不变量和延迟调度的可执行计划。Mem-pi 把记忆做成一个会选择是否发声的生成式策略。DeepWeb-Bench 则提醒我们，深度研究智能体的主要错误往往不在“找不到资料”，而在推导、校准和跨来源协调。

Compiled Agents, Adaptive Memory, and Derivation Debt

16 minute read

Published: May 21, 2026

TL;DR: this round is about the runtime around an agent, not only the agent model. Agent JIT treats web-agent plans as compilable code with state invariants and latency-aware scheduling. Mem-pi turns memory into a small generative policy that can decide not to speak. DeepWeb-Bench shows that deep-research agents often find sources but still fail at derivation, calibration, and cross-source reconciliation.

从可执行软件世界到物理结构化世界模型

5 minute read

Published: May 20, 2026

TL;DR：本期关注的是智能体训练和评测里的“世界”到底有多可靠。EnvFactory 自动合成可执行、可验证的工具环境，用来训练 tool-use agents；OpenComputer 把桌面软件任务做成可以读取真实应用状态的 verifier-grounded software worlds；PH-Dreamer 则把 Port-Hamiltonian 物理结构放进视觉世界模型，让 latent imagination 不只追奖励，也要更接近能量和运动约束。

Executable Worlds for Agents, Physical Priors for World Models

15 minute read

Published: May 20, 2026

TL;DR: this round is about making the world around an agent less imaginary. EnvFactory synthesizes executable tool environments for agentic RL instead of training only on over-specified traces. OpenComputer builds verifier-grounded desktop software worlds where computer-use agents can be scored by application state, not just screenshots or LLM judgment. PH-Dreamer puts a Port-Hamiltonian structure inside a visual world model, so latent imagination is pressured to respect energy and smoothness instead of only predicting reward.

先探索再行动，先组证据再作答

4 minute read

Published: May 18, 2026

TL;DR：本期我关注的是智能体和世界模型在作答或行动之前应该先做什么。一篇论文把“探索陌生环境”做成可训练能力，而不是指望任务奖励顺便学出来；一篇论文把 deep research 从多条搜索轨迹的投票，改成共享证据图的组装；一篇论文则追问视频 latent prediction 到底有没有学到更像世界模型的表示，而不只看干净分类准确率。

Exploration Before Action, Evidence Before Answers

15 minute read

Published: May 18, 2026

TL;DR: this round is about agents and world models that should not rush straight to an answer. One paper trains LLM agents to explore an unfamiliar environment before acting. One turns deep research into a shared evidence graph rather than a pile of independent search traces. One asks whether latent video prediction produces world-model representations that survive corruption, occlusion, contact ambiguity, and reversed time.

从路径选择到路由几何

4 minute read

Published: May 13, 2026

TL;DR：本期我关注的不是“智能体有没有中间状态”，而是这些中间结构怎样改变决策本身。ToolCUA 研究电脑使用智能体什么时候该继续点 GUI、什么时候该切到工具；DataMaster 把数据工程变成带分支、记忆和反馈的搜索；PriorZero 试图把语言模型先验接入世界模型规划，但只放在 MCTS 根节点；MoE 路由论文则从机理上解释 router 和 expert 为什么会学出共享几何，以及常见 load-balancing loss 可能怎样破坏这种几何。

Path Choices, Data Search, and Routing Geometry

15 minute read

Published: May 13, 2026

TL;DR: this round is about decision surfaces inside systems that look agentic from the outside. ToolCUA asks when a computer-use agent should stay with GUI actions and when it should call a tool. DataMaster treats data engineering as a branching search problem rather than a one-shot preprocessing step. PriorZero tries to use language priors without letting them corrupt a learned world model. The MoE routing paper is more mechanistic: it argues that routers and experts learn a shared geometry, and that common load-balancing losses can damage it.

在智能体失败之前读懂执行轨迹

6 minute read

Published: May 11, 2026

TL;DR：本期看的是智能体轨迹如何成为训练和诊断对象。A3 用 shell 命令结构给 CLI 智能体做逐步 credit assignment，而不是只看整条轨迹成败。Tool Calling 论文发现，模型内部的工具选择可以线性读出和干预，错误工具调用在 JSON 输出前就有信号。MASPrism 则用小模型 prefill 阶段的 NLL 和 attention，在长多智能体日志里定位可能的失败源头，而且不需要生成诊断长文。

Reading Agent Traces Before They Become Failures

17 minute read

Published: May 11, 2026

TL;DR: this round is about agent traces as training and diagnostic objects. A3 trains CLI agents by assigning credit to shell actions rather than only to whole trajectories. The tool-calling paper shows that tool choice is linearly readable and steerable inside language models before the JSON call is emitted. MASPrism uses small-model prefill signals to locate likely failure sources in long multi-agent logs without decoding a diagnostic answer.

在行动之前读懂隐藏状态

5 minute read

Published: May 10, 2026

简短 TL;DR：本期关注的是模型或智能体在行动之前到底“看见”了什么。Natural Language Autoencoders 把残差流激活翻译成自然语言，给模型审计提供一个可读入口。BAMI 诊断 GUI grounding 的坐标偏差，并在测试时纠正一部分错误点击。Sheet as Token 把多表格工作簿压缩成可检索的 sheet 级对象，让数据智能体不必一上来就吞整本 Excel。

Reading Hidden State Before Agents Act

17 minute read

Published: May 10, 2026

TL;DR: this round is about making hidden state readable before it becomes a wrong action. Natural Language Autoencoders translate residual-stream activations into text for model auditing. BAMI diagnoses GUI grounding failures and fixes some of them at test time. Sheet as Token turns messy multi-sheet workbooks into retrievable sheet-level objects for data agents.

让长程智能体会规划、会分工、也会验引文

4 minute read

Published: May 09, 2026

TL;DR：本期看长程智能体如何摆脱单一 reactive loop。StraTA 让 agent 在执行前先生成全局策略，并把后续动作都放在这个策略下训练；RAO 把递归 subagent 变成可训练的推理时扩展机制；引文归因评测论文则检查 deep research 报告里的引用是否真的支撑旁边那句话。

Strategies, Subagents, and Citation Checks for Long-Horizon Work

16 minute read

Published: May 09, 2026

TL;DR: this round is about long-horizon work that does not fit inside a single reactive loop. StraTA trains an agent to carry a global strategy through an episode. RAO trains a model to use recursive subagents rather than treating delegation as a fixed scaffold. The source-attribution paper asks whether deep-research citations actually support the claims they are attached to.

技能、检索与记忆化世界模型

4 minute read

Published: May 08, 2026

TL;DR：本期看的是智能体工作流里的“可操作状态”。SkillOS 让智能体从经验里学会维护技能库；SIRA 把多轮搜索压缩成一次有语料意识的词法检索动作；HaM-World 用选择性记忆和几何结构稳定规划用的世界模型潜变量。

Skills, Retrieval, and Memory for Agent Workflows

13 minute read

Published: May 08, 2026

TL;DR: this round is about agent-facing state. SkillOS learns how to curate reusable skills from experience. SIRA compresses retrieval into one corpus-discriminative lexical action. HaM-World gives planning a structured latent with memory and geometry so rollouts do not fall apart as quickly.

让中间产物能被智能体真正使用

4 minute read

Published: May 06, 2026

TL;DR：本期看的是“中间产物”能不能被下游系统真正消费。TraceLift 讨论推理轨迹是否应该按它对冻结执行器的帮助来奖励；BRIGHT-Pro 讨论检索器是否应该覆盖一组互补证据，而不是只找一个相关段落；Agentic-imodels 则把可解释模型重新定义为“智能体读得懂、能模拟”的工具。

Intermediate Work That Agents Can Actually Use

13 minute read

Published: May 06, 2026

TL;DR: this round is about intermediate work that another system has to consume. TraceLift asks whether a reasoning plan should be rewarded only when it helps a frozen executor. BRIGHT-Pro asks whether retrieval should cover an evidence portfolio rather than one relevant passage. Agentic-imodels asks whether an interpretable model should be readable by an agent, not only by a human analyst.

从搜索轨迹到记忆电路，再到世界理论

5 minute read

Published: May 06, 2026

TL;DR：本期我想看的是智能体给出答案之前已经学到的结构。OpenSeeker-v2 讨论高质量、长难度搜索轨迹能把纯 SFT 搜索智能体推到什么程度；agent memory circuit 论文把写入、管理、读取三个记忆阶段拆开做电路追踪；Learning-to-Theorize 则把 world model 从“预测下一帧”推向“从观察中归纳可执行理论”。

Training Signals, Memory Circuits, and Theories of the World

17 minute read

Published: May 06, 2026

TL;DR: this round is about structure that is learned before an agent produces the final answer. OpenSeeker-v2 asks how much frontier search-agent behavior can come from carefully filtered SFT trajectories. The agent-memory circuit paper opens the write-manage-read loop and shows that routing, extraction, and grounding emerge at different model scales. Learning-to-Theorize pushes world models away from pure prediction and toward executable, compositional theories inferred from raw observations.

先看清证据，再让模型回答

6 minute read

Published: May 05, 2026

TL;DR：本期没有退回 4 月 30 日，而是保留 5 月 3-4 日的新论文，主题是“模型在回答或生成之前应该先看什么”。FlexSQL 让 data agent 在推理过程中反复检查 schema、取值、执行结果和计划，而不是一次性把 schema retrieval 固定下来。Chart-FR1 把密集图表推理训练成显式视觉聚焦过程，让 reasoning step 绑定 OCR 文本和局部区域。PV-VAE 则把 video VAE 从纯重建改成预测式重建，迫使 latent 携带运动和未来变化信息。

Agents That Look Before They Answer

18 minute read

Published: May 05, 2026

TL;DR: this round stayed inside the fresh May 3-4 window and picked three papers about giving models a better inspection step before they answer or generate. FlexSQL lets a data agent revisit schemas, values, execution results, and plans instead of freezing retrieval up front. Chart-FR1 trains chart reasoning around explicit visual focus, so dense charts are not treated as one undifferentiated image. PV-VAE changes video VAE training from pure reconstruction to predictive reconstruction, pushing latents to carry motion-relevant structure rather than only pixel detail.

把证据从提示词里移出来

6 minute read

Published: May 03, 2026

TL;DR：5 月 1 日到 3 日目标主题里的 arXiv 新稿很少，所以本期在去重后扩展到 4 月 30 日最新窗口。我选了三篇都在把证据移出 prompt 的论文：Claw-Eval-Live 用动态需求信号刷新 workflow-agent 评测，并按可观察动作打分；ObjectGraph 把文档变成可遍历的有类型图，而不是整段注入上下文；CIRM 在推理时干预 reward model 激活，降低格式捷径变成训练标签的风险。

Evidence Surfaces for Agents Outside the Prompt

16 minute read

Published: May 03, 2026

TL;DR: the newest May 1-3 arXiv window was thin for the tracked topics, so I expanded to the freshest April 30 papers after deduplicating the existing Paper Radar list. I picked three papers that all move evidence out of the prompt: Claw-Eval-Live refreshes workflow-agent evaluation from live demand signals and grades observable action, ObjectGraph treats documents as traversable typed graphs rather than injected strings, and CIRM edits reward-model activations at inference time so style shortcuts are less likely to become training labels.

在漂移变成答案之前加检查

5 minute read

Published: May 03, 2026

TL;DR：这一期我关注的是系统在“看起来已经完成”之前，能不能先发现或修正漂移。5 月 1 日到 3 日目标方向里没有足够新的 arXiv CS 投稿，所以我扩展到 4 月 30 日最新窗口，选了四篇开放全文：PRISM 讨论多模态模型在 RLVR 前的预对齐，PhyCo 讨论视频生成里的物理属性控制，FCMBench-Video 讨论文档证据随时间展开时的评测，Latent Adversarial Detection 讨论多轮攻击意图在激活轨迹中的信号。

Checks Before Agents Drift

20 minute read

Published: May 03, 2026

TL;DR: this round is about checks that happen before an agent or multimodal system drifts into a polished but wrong outcome. I found no fresh May 1-3 arXiv CS submissions in the target topics, so I expanded to the newest April 30 window and selected four open papers: PRISM for pre-aligning multimodal policies before RLVR, PhyCo for injecting physical priors into video generation, FCMBench-Video for document evidence over time, and Latent Adversarial Detection for activation-level multi-turn safety probes.

让智能体有可操作的结构化表面

5 minute read

Published: May 02, 2026

TL;DR：这一期我关注的不是“再给智能体更多上下文”或“每一步都换成更强模型”，而是智能体工作时需要什么样的结构化操作表面。入选的四篇分别讨论：用共享工作板做大规模 web-to-table 数据抽取、用事件检测决定电脑使用智能体何时升级到大模型、用 trace 分析多智能体系统中的信息污染，以及把 reward model 的机理分析目标从词表投影换成 reward head 方向。

Operating Surfaces for Agents That Need to Stay Correct

16 minute read

Published: May 02, 2026

TL;DR: this round is about agents needing structured operating surfaces, not just longer context or more calls to a stronger model. I picked four papers after excluding the recent seen list: a web-to-table multi-agent system with a shared workboard, a step-level cascade for computer-use agents, a trace-level study of information contamination, and a reward-model interpretability library. The first three sit in the newest April 29-30 window; I expanded one slot to April 28 for the mechanism paper because the freshest sparse-autoencoder item had already been covered.

给科学研究智能体更好的结构化接口

4 minute read

Published: May 02, 2026

TL;DR：这期我看的不是“智能体又多会做题了”，而是科学研究智能体到底需要什么接口。四篇论文分别把方法演化图谱、领域 foundation model 调用、复杂文献发现、科学可视化 workflow 做成了更明确的结构。我的判断是：研究智能体真正缺的往往不是更长上下文，而是能被查询、调用、验证和回放的中间层。

Structured Interfaces for Scientific Agents

18 minute read

Published: May 02, 2026

TL;DR: this round is about scientific agents needing better interfaces to knowledge, tools, and intermediate workflow state. I picked four papers that make that interface explicit: a method-evolution graph for AI scientists, a framework that lets language agents call domain foundation models, a benchmark for hard literature discovery, and a controlled comparison of scientific-visualization agent interaction styles.

让数据与界面智能体面对硬证据

3 minute read

Published: May 01, 2026

TL;DR：这期我看的是“证据对象变硬”这件事。四篇论文都不满足于让智能体写一个像样的答案，而是继续追问：科学数据任务能不能被可执行 verifier 检查，表格问答能不能识别隐含预测意图，GUI 智能体能不能到达精确状态，LLM 生成的 reward 能不能在合适的训练阶段再部署。

Hard Evidence for Data and Workflow Agents

16 minute read

Published: May 01, 2026

TL;DR: this round is about evaluation objects getting harder. The four papers I chose are not mainly asking whether an agent writes a plausible answer. They ask whether it can pass an executable scientific checker, infer a hidden predictive intent from a table, hit an exact GUI state, or deploy a reward only when the learner is competent enough for that reward to mean something.

为什么智能体需要显式中间层

3 minute read

Published: May 01, 2026

TL;DR：这期我盯着同一件事。真正有意思的工作，不是把端到端系统再堆大一点，而是把中间层做显式：BEV token、latent CoT、概念流形、检索页。模型一旦有了可检查的中间表示，几何、控制和证据就都更容易对齐。

Why Agents Need an Explicit Middle Layer

14 minute read

Published: May 01, 2026

TL;DR: this round keeps circling one idea. The useful papers are not just bigger end-to-end systems, they are systems that make the middle layer explicit: BEV tokens, latent reasoning, manifold groups, retrieval pages. That is where the model gets a handle on geometry, control, or evidence.

让长程智能体拥有可回放工作区

4 minute read

Published: May 01, 2026

TL;DR： 本期关注一个更底层的问题：长程智能体的工作过程能不能被复现、恢复和审计。我选了四篇 4 月 30 日的新论文，因为它们分别把长程智能体里容易被忽略的层显式化：Synthetic Computers 构造用户级工作区，Crab 保存沙箱操作系统状态，Exploration Hacking 检验模型是否能通过控制探索来抵抗强化学习，COHERENCE 把图文交错文档理解变成可核验的对齐任务。

Replayable Workspaces for Long-Horizon Agents

19 minute read

Published: May 01, 2026

TL;DR: This round is about agents whose work can be replayed, repaired, and audited. I selected four April 30 papers because they each make one hidden layer of long-horizon agency explicit: Synthetic Computers builds user-specific workspaces for productivity simulation, Crab checkpoints the OS state of agent sandboxes, Exploration Hacking tests whether RL can be resisted by strategic under-exploration, and COHERENCE turns interleaved document understanding into a verifiable alignment task.

从闭环反馈到可审计工作流

3 minute read

Published: April 30, 2026

简短 TL;DR： 本期我把注意力从“智能体能不能刷过一个 benchmark”转到更难的问题：训练和工作流结束后，过程还能不能被核验。四篇论文分别给出四个入口：FutureWorld 用真实世界事件兑现后的结果做延迟奖励；AgentSim 生成可核验的 RAG 智能体轨迹；DV-World 把数据可视化智能体放进原生软件、跨框架演化和交互澄清流程；MoRFI 则从机制层看，后训练引入新事实时为什么可能损伤模型对旧知识的访问。

Closed Loops for Auditable Agents

17 minute read

Published: April 30, 2026

TL;DR: This round moves from “can an agent solve a benchmark?” toward a harder question: can the loop be audited after the fact? I selected four recent open papers because they put feedback and evidence into the system boundary: FutureWorld waits for real outcomes before assigning rewards, AgentSim records verifiable RAG traces, DV-World tests data-visualization agents in native and interactive workflows, and MoRFI probes how post-training on new facts leaves sparse mechanistic traces.

让长程智能体拥有可核验的状态

4 minute read

Published: April 30, 2026

简短 TL;DR： 本期关注一个很实用的变化：更强的智能体不只是更长上下文模型，而是能生成任务、核验中间状态、保存证据、并从世界状态变化中学习的系统。我选了四篇近期开放全文论文：ClawGym 做可执行、可核验的电脑使用任务；World2VLM 把世界模型想象变成训练信号；DataPRM 让数据分析奖励模型进入环境检查步骤；OCR-Memory 则把长程历史保存成可光学检索的证据。

Verifiable State for Long-Horizon Agents

16 minute read

Published: April 30, 2026

TL;DR: This round is about a practical shift in agent research: stronger agents are not only longer-context models, but systems that synthesize tasks, verify intermediate state, preserve evidence, and learn from world-state transitions. I selected four recent open papers because each one makes state more inspectable: ClawGym builds verifiable computer-use tasks, World2VLM turns world-model imagination into training data, DataPRM verifies data-analysis steps inside the environment, and OCR-Memory stores long agent histories as optically retrievable evidence.

从编译式智能体到选择性记忆与深度研究推导债

3 minute read

Published: May 21, 2026

TL;DR：本期看的不是“模型又会了什么”，而是智能体周围的运行时怎么变得更可控。Agent JIT 把网页智能体任务编译成带状态不变量和延迟调度的可执行计划。Mem-pi 把记忆做成一个会选择是否发声的生成式策略。DeepWeb-Bench 则提醒我们，深度研究智能体的主要错误往往不在“找不到资料”，而在推导、校准和跨来源协调。

Compiled Agents, Adaptive Memory, and Derivation Debt

16 minute read

Published: May 21, 2026

TL;DR: this round is about the runtime around an agent, not only the agent model. Agent JIT treats web-agent plans as compilable code with state invariants and latency-aware scheduling. Mem-pi turns memory into a small generative policy that can decide not to speak. DeepWeb-Bench shows that deep-research agents often find sources but still fail at derivation, calibration, and cross-source reconciliation.

从可执行软件世界到物理结构化世界模型

5 minute read

Published: May 20, 2026

TL;DR：本期关注的是智能体训练和评测里的“世界”到底有多可靠。EnvFactory 自动合成可执行、可验证的工具环境，用来训练 tool-use agents；OpenComputer 把桌面软件任务做成可以读取真实应用状态的 verifier-grounded software worlds；PH-Dreamer 则把 Port-Hamiltonian 物理结构放进视觉世界模型，让 latent imagination 不只追奖励，也要更接近能量和运动约束。

Executable Worlds for Agents, Physical Priors for World Models

15 minute read

Published: May 20, 2026

TL;DR: this round is about making the world around an agent less imaginary. EnvFactory synthesizes executable tool environments for agentic RL instead of training only on over-specified traces. OpenComputer builds verifier-grounded desktop software worlds where computer-use agents can be scored by application state, not just screenshots or LLM judgment. PH-Dreamer puts a Port-Hamiltonian structure inside a visual world model, so latent imagination is pressured to respect energy and smoothness instead of only predicting reward.

先探索再行动，先组证据再作答

4 minute read

Published: May 18, 2026

TL;DR：本期我关注的是智能体和世界模型在作答或行动之前应该先做什么。一篇论文把“探索陌生环境”做成可训练能力，而不是指望任务奖励顺便学出来；一篇论文把 deep research 从多条搜索轨迹的投票，改成共享证据图的组装；一篇论文则追问视频 latent prediction 到底有没有学到更像世界模型的表示，而不只看干净分类准确率。

Exploration Before Action, Evidence Before Answers

15 minute read

Published: May 18, 2026

TL;DR: this round is about agents and world models that should not rush straight to an answer. One paper trains LLM agents to explore an unfamiliar environment before acting. One turns deep research into a shared evidence graph rather than a pile of independent search traces. One asks whether latent video prediction produces world-model representations that survive corruption, occlusion, contact ambiguity, and reversed time.

从路径选择到路由几何

4 minute read

Published: May 13, 2026

TL;DR：本期我关注的不是“智能体有没有中间状态”，而是这些中间结构怎样改变决策本身。ToolCUA 研究电脑使用智能体什么时候该继续点 GUI、什么时候该切到工具；DataMaster 把数据工程变成带分支、记忆和反馈的搜索；PriorZero 试图把语言模型先验接入世界模型规划，但只放在 MCTS 根节点；MoE 路由论文则从机理上解释 router 和 expert 为什么会学出共享几何，以及常见 load-balancing loss 可能怎样破坏这种几何。

Path Choices, Data Search, and Routing Geometry

15 minute read

Published: May 13, 2026

TL;DR: this round is about decision surfaces inside systems that look agentic from the outside. ToolCUA asks when a computer-use agent should stay with GUI actions and when it should call a tool. DataMaster treats data engineering as a branching search problem rather than a one-shot preprocessing step. PriorZero tries to use language priors without letting them corrupt a learned world model. The MoE routing paper is more mechanistic: it argues that routers and experts learn a shared geometry, and that common load-balancing losses can damage it.

在智能体失败之前读懂执行轨迹

6 minute read

Published: May 11, 2026

TL;DR：本期看的是智能体轨迹如何成为训练和诊断对象。A3 用 shell 命令结构给 CLI 智能体做逐步 credit assignment，而不是只看整条轨迹成败。Tool Calling 论文发现，模型内部的工具选择可以线性读出和干预，错误工具调用在 JSON 输出前就有信号。MASPrism 则用小模型 prefill 阶段的 NLL 和 attention，在长多智能体日志里定位可能的失败源头，而且不需要生成诊断长文。

Reading Agent Traces Before They Become Failures

17 minute read

Published: May 11, 2026

TL;DR: this round is about agent traces as training and diagnostic objects. A3 trains CLI agents by assigning credit to shell actions rather than only to whole trajectories. The tool-calling paper shows that tool choice is linearly readable and steerable inside language models before the JSON call is emitted. MASPrism uses small-model prefill signals to locate likely failure sources in long multi-agent logs without decoding a diagnostic answer.

在行动之前读懂隐藏状态

5 minute read

Published: May 10, 2026

简短 TL;DR：本期关注的是模型或智能体在行动之前到底“看见”了什么。Natural Language Autoencoders 把残差流激活翻译成自然语言，给模型审计提供一个可读入口。BAMI 诊断 GUI grounding 的坐标偏差，并在测试时纠正一部分错误点击。Sheet as Token 把多表格工作簿压缩成可检索的 sheet 级对象，让数据智能体不必一上来就吞整本 Excel。

Reading Hidden State Before Agents Act

17 minute read

Published: May 10, 2026

TL;DR: this round is about making hidden state readable before it becomes a wrong action. Natural Language Autoencoders translate residual-stream activations into text for model auditing. BAMI diagnoses GUI grounding failures and fixes some of them at test time. Sheet as Token turns messy multi-sheet workbooks into retrievable sheet-level objects for data agents.

让长程智能体会规划、会分工、也会验引文

4 minute read

Published: May 09, 2026

TL;DR：本期看长程智能体如何摆脱单一 reactive loop。StraTA 让 agent 在执行前先生成全局策略，并把后续动作都放在这个策略下训练；RAO 把递归 subagent 变成可训练的推理时扩展机制；引文归因评测论文则检查 deep research 报告里的引用是否真的支撑旁边那句话。

Strategies, Subagents, and Citation Checks for Long-Horizon Work

16 minute read

Published: May 09, 2026

TL;DR: this round is about long-horizon work that does not fit inside a single reactive loop. StraTA trains an agent to carry a global strategy through an episode. RAO trains a model to use recursive subagents rather than treating delegation as a fixed scaffold. The source-attribution paper asks whether deep-research citations actually support the claims they are attached to.

技能、检索与记忆化世界模型

4 minute read

Published: May 08, 2026

TL;DR：本期看的是智能体工作流里的“可操作状态”。SkillOS 让智能体从经验里学会维护技能库；SIRA 把多轮搜索压缩成一次有语料意识的词法检索动作；HaM-World 用选择性记忆和几何结构稳定规划用的世界模型潜变量。

Skills, Retrieval, and Memory for Agent Workflows

13 minute read

Published: May 08, 2026

TL;DR: this round is about agent-facing state. SkillOS learns how to curate reusable skills from experience. SIRA compresses retrieval into one corpus-discriminative lexical action. HaM-World gives planning a structured latent with memory and geometry so rollouts do not fall apart as quickly.

让中间产物能被智能体真正使用

4 minute read

Published: May 06, 2026

TL;DR：本期看的是“中间产物”能不能被下游系统真正消费。TraceLift 讨论推理轨迹是否应该按它对冻结执行器的帮助来奖励；BRIGHT-Pro 讨论检索器是否应该覆盖一组互补证据，而不是只找一个相关段落；Agentic-imodels 则把可解释模型重新定义为“智能体读得懂、能模拟”的工具。

Intermediate Work That Agents Can Actually Use

13 minute read

Published: May 06, 2026

TL;DR: this round is about intermediate work that another system has to consume. TraceLift asks whether a reasoning plan should be rewarded only when it helps a frozen executor. BRIGHT-Pro asks whether retrieval should cover an evidence portfolio rather than one relevant passage. Agentic-imodels asks whether an interpretable model should be readable by an agent, not only by a human analyst.

从搜索轨迹到记忆电路，再到世界理论

5 minute read

Published: May 06, 2026

TL;DR：本期我想看的是智能体给出答案之前已经学到的结构。OpenSeeker-v2 讨论高质量、长难度搜索轨迹能把纯 SFT 搜索智能体推到什么程度；agent memory circuit 论文把写入、管理、读取三个记忆阶段拆开做电路追踪；Learning-to-Theorize 则把 world model 从“预测下一帧”推向“从观察中归纳可执行理论”。

Training Signals, Memory Circuits, and Theories of the World

17 minute read

Published: May 06, 2026

TL;DR: this round is about structure that is learned before an agent produces the final answer. OpenSeeker-v2 asks how much frontier search-agent behavior can come from carefully filtered SFT trajectories. The agent-memory circuit paper opens the write-manage-read loop and shows that routing, extraction, and grounding emerge at different model scales. Learning-to-Theorize pushes world models away from pure prediction and toward executable, compositional theories inferred from raw observations.

先看清证据，再让模型回答

6 minute read

Published: May 05, 2026

TL;DR：本期没有退回 4 月 30 日，而是保留 5 月 3-4 日的新论文，主题是“模型在回答或生成之前应该先看什么”。FlexSQL 让 data agent 在推理过程中反复检查 schema、取值、执行结果和计划，而不是一次性把 schema retrieval 固定下来。Chart-FR1 把密集图表推理训练成显式视觉聚焦过程，让 reasoning step 绑定 OCR 文本和局部区域。PV-VAE 则把 video VAE 从纯重建改成预测式重建，迫使 latent 携带运动和未来变化信息。

Agents That Look Before They Answer

18 minute read

Published: May 05, 2026

TL;DR: this round stayed inside the fresh May 3-4 window and picked three papers about giving models a better inspection step before they answer or generate. FlexSQL lets a data agent revisit schemas, values, execution results, and plans instead of freezing retrieval up front. Chart-FR1 trains chart reasoning around explicit visual focus, so dense charts are not treated as one undifferentiated image. PV-VAE changes video VAE training from pure reconstruction to predictive reconstruction, pushing latents to carry motion-relevant structure rather than only pixel detail.

把证据从提示词里移出来

6 minute read

Published: May 03, 2026

TL;DR：5 月 1 日到 3 日目标主题里的 arXiv 新稿很少，所以本期在去重后扩展到 4 月 30 日最新窗口。我选了三篇都在把证据移出 prompt 的论文：Claw-Eval-Live 用动态需求信号刷新 workflow-agent 评测，并按可观察动作打分；ObjectGraph 把文档变成可遍历的有类型图，而不是整段注入上下文；CIRM 在推理时干预 reward model 激活，降低格式捷径变成训练标签的风险。

Evidence Surfaces for Agents Outside the Prompt

16 minute read

Published: May 03, 2026

TL;DR: the newest May 1-3 arXiv window was thin for the tracked topics, so I expanded to the freshest April 30 papers after deduplicating the existing Paper Radar list. I picked three papers that all move evidence out of the prompt: Claw-Eval-Live refreshes workflow-agent evaluation from live demand signals and grades observable action, ObjectGraph treats documents as traversable typed graphs rather than injected strings, and CIRM edits reward-model activations at inference time so style shortcuts are less likely to become training labels.

在漂移变成答案之前加检查

5 minute read

Published: May 03, 2026

TL;DR：这一期我关注的是系统在“看起来已经完成”之前，能不能先发现或修正漂移。5 月 1 日到 3 日目标方向里没有足够新的 arXiv CS 投稿，所以我扩展到 4 月 30 日最新窗口，选了四篇开放全文：PRISM 讨论多模态模型在 RLVR 前的预对齐，PhyCo 讨论视频生成里的物理属性控制，FCMBench-Video 讨论文档证据随时间展开时的评测，Latent Adversarial Detection 讨论多轮攻击意图在激活轨迹中的信号。

Checks Before Agents Drift

20 minute read

Published: May 03, 2026

TL;DR: this round is about checks that happen before an agent or multimodal system drifts into a polished but wrong outcome. I found no fresh May 1-3 arXiv CS submissions in the target topics, so I expanded to the newest April 30 window and selected four open papers: PRISM for pre-aligning multimodal policies before RLVR, PhyCo for injecting physical priors into video generation, FCMBench-Video for document evidence over time, and Latent Adversarial Detection for activation-level multi-turn safety probes.

让智能体有可操作的结构化表面

5 minute read

Published: May 02, 2026

TL;DR：这一期我关注的不是“再给智能体更多上下文”或“每一步都换成更强模型”，而是智能体工作时需要什么样的结构化操作表面。入选的四篇分别讨论：用共享工作板做大规模 web-to-table 数据抽取、用事件检测决定电脑使用智能体何时升级到大模型、用 trace 分析多智能体系统中的信息污染，以及把 reward model 的机理分析目标从词表投影换成 reward head 方向。

Operating Surfaces for Agents That Need to Stay Correct

16 minute read

Published: May 02, 2026

TL;DR: this round is about agents needing structured operating surfaces, not just longer context or more calls to a stronger model. I picked four papers after excluding the recent seen list: a web-to-table multi-agent system with a shared workboard, a step-level cascade for computer-use agents, a trace-level study of information contamination, and a reward-model interpretability library. The first three sit in the newest April 29-30 window; I expanded one slot to April 28 for the mechanism paper because the freshest sparse-autoencoder item had already been covered.

给科学研究智能体更好的结构化接口

4 minute read

Published: May 02, 2026

TL;DR：这期我看的不是“智能体又多会做题了”，而是科学研究智能体到底需要什么接口。四篇论文分别把方法演化图谱、领域 foundation model 调用、复杂文献发现、科学可视化 workflow 做成了更明确的结构。我的判断是：研究智能体真正缺的往往不是更长上下文，而是能被查询、调用、验证和回放的中间层。

Structured Interfaces for Scientific Agents

18 minute read

Published: May 02, 2026

TL;DR: this round is about scientific agents needing better interfaces to knowledge, tools, and intermediate workflow state. I picked four papers that make that interface explicit: a method-evolution graph for AI scientists, a framework that lets language agents call domain foundation models, a benchmark for hard literature discovery, and a controlled comparison of scientific-visualization agent interaction styles.

让数据与界面智能体面对硬证据

3 minute read

Published: May 01, 2026

TL;DR：这期我看的是“证据对象变硬”这件事。四篇论文都不满足于让智能体写一个像样的答案，而是继续追问：科学数据任务能不能被可执行 verifier 检查，表格问答能不能识别隐含预测意图，GUI 智能体能不能到达精确状态，LLM 生成的 reward 能不能在合适的训练阶段再部署。

Hard Evidence for Data and Workflow Agents

16 minute read

Published: May 01, 2026

TL;DR: this round is about evaluation objects getting harder. The four papers I chose are not mainly asking whether an agent writes a plausible answer. They ask whether it can pass an executable scientific checker, infer a hidden predictive intent from a table, hit an exact GUI state, or deploy a reward only when the learner is competent enough for that reward to mean something.

为什么智能体需要显式中间层

3 minute read

Published: May 01, 2026

TL;DR：这期我盯着同一件事。真正有意思的工作，不是把端到端系统再堆大一点，而是把中间层做显式：BEV token、latent CoT、概念流形、检索页。模型一旦有了可检查的中间表示，几何、控制和证据就都更容易对齐。

Why Agents Need an Explicit Middle Layer

14 minute read

Published: May 01, 2026

TL;DR: this round keeps circling one idea. The useful papers are not just bigger end-to-end systems, they are systems that make the middle layer explicit: BEV tokens, latent reasoning, manifold groups, retrieval pages. That is where the model gets a handle on geometry, control, or evidence.

让长程智能体拥有可回放工作区

4 minute read

Published: May 01, 2026

TL;DR： 本期关注一个更底层的问题：长程智能体的工作过程能不能被复现、恢复和审计。我选了四篇 4 月 30 日的新论文，因为它们分别把长程智能体里容易被忽略的层显式化：Synthetic Computers 构造用户级工作区，Crab 保存沙箱操作系统状态，Exploration Hacking 检验模型是否能通过控制探索来抵抗强化学习，COHERENCE 把图文交错文档理解变成可核验的对齐任务。

Replayable Workspaces for Long-Horizon Agents

19 minute read

Published: May 01, 2026

TL;DR: This round is about agents whose work can be replayed, repaired, and audited. I selected four April 30 papers because they each make one hidden layer of long-horizon agency explicit: Synthetic Computers builds user-specific workspaces for productivity simulation, Crab checkpoints the OS state of agent sandboxes, Exploration Hacking tests whether RL can be resisted by strategic under-exploration, and COHERENCE turns interleaved document understanding into a verifiable alignment task.

从闭环反馈到可审计工作流

3 minute read

Published: April 30, 2026

简短 TL;DR： 本期我把注意力从“智能体能不能刷过一个 benchmark”转到更难的问题：训练和工作流结束后，过程还能不能被核验。四篇论文分别给出四个入口：FutureWorld 用真实世界事件兑现后的结果做延迟奖励；AgentSim 生成可核验的 RAG 智能体轨迹；DV-World 把数据可视化智能体放进原生软件、跨框架演化和交互澄清流程；MoRFI 则从机制层看，后训练引入新事实时为什么可能损伤模型对旧知识的访问。

Closed Loops for Auditable Agents

17 minute read

Published: April 30, 2026

TL;DR: This round moves from “can an agent solve a benchmark?” toward a harder question: can the loop be audited after the fact? I selected four recent open papers because they put feedback and evidence into the system boundary: FutureWorld waits for real outcomes before assigning rewards, AgentSim records verifiable RAG traces, DV-World tests data-visualization agents in native and interactive workflows, and MoRFI probes how post-training on new facts leaves sparse mechanistic traces.

让长程智能体拥有可核验的状态

4 minute read

Published: April 30, 2026

简短 TL;DR： 本期关注一个很实用的变化：更强的智能体不只是更长上下文模型，而是能生成任务、核验中间状态、保存证据、并从世界状态变化中学习的系统。我选了四篇近期开放全文论文：ClawGym 做可执行、可核验的电脑使用任务；World2VLM 把世界模型想象变成训练信号；DataPRM 让数据分析奖励模型进入环境检查步骤；OCR-Memory 则把长程历史保存成可光学检索的证据。

Verifiable State for Long-Horizon Agents

16 minute read

Published: April 30, 2026

TL;DR: This round is about a practical shift in agent research: stronger agents are not only longer-context models, but systems that synthesize tasks, verify intermediate state, preserve evidence, and learn from world-state transitions. I selected four recent open papers because each one makes state more inspectable: ClawGym builds verifiable computer-use tasks, World2VLM turns world-model imagination into training data, DataPRM verifies data-analysis steps inside the environment, and OCR-Memory stores long agent histories as optically retrievable evidence.

Posts by Tags

AI

DID

agents

econometrics

measurement error

paper-digest

paper-radar