Posts by Tags

AI

从路径选择到路由几何

4 minute read

Published:

TL;DR:本期我关注的不是“智能体有没有中间状态”,而是这些中间结构怎样改变决策本身。ToolCUA 研究电脑使用智能体什么时候该继续点 GUI、什么时候该切到工具;DataMaster 把数据工程变成带分支、记忆和反馈的搜索;PriorZero 试图把语言模型先验接入世界模型规划,但只放在 MCTS 根节点;MoE 路由论文则从机理上解释 router 和 expert 为什么会学出共享几何,以及常见 load-balancing loss 可能怎样破坏这种几何。

Path Choices, Data Search, and Routing Geometry

15 minute read

Published:

TL;DR: this round is about decision surfaces inside systems that look agentic from the outside. ToolCUA asks when a computer-use agent should stay with GUI actions and when it should call a tool. DataMaster treats data engineering as a branching search problem rather than a one-shot preprocessing step. PriorZero tries to use language priors without letting them corrupt a learned world model. The MoE routing paper is more mechanistic: it argues that routers and experts learn a shared geometry, and that common load-balancing losses can damage it.

在智能体失败之前读懂执行轨迹

6 minute read

Published:

TL;DR:本期看的是智能体轨迹如何成为训练和诊断对象。A3 用 shell 命令结构给 CLI 智能体做逐步 credit assignment,而不是只看整条轨迹成败。Tool Calling 论文发现,模型内部的工具选择可以线性读出和干预,错误工具调用在 JSON 输出前就有信号。MASPrism 则用小模型 prefill 阶段的 NLL 和 attention,在长多智能体日志里定位可能的失败源头,而且不需要生成诊断长文。

Reading Agent Traces Before They Become Failures

17 minute read

Published:

TL;DR: this round is about agent traces as training and diagnostic objects. A3 trains CLI agents by assigning credit to shell actions rather than only to whole trajectories. The tool-calling paper shows that tool choice is linearly readable and steerable inside language models before the JSON call is emitted. MASPrism uses small-model prefill signals to locate likely failure sources in long multi-agent logs without decoding a diagnostic answer.

在行动之前读懂隐藏状态

5 minute read

Published:

简短 TL;DR:本期关注的是模型或智能体在行动之前到底“看见”了什么。Natural Language Autoencoders 把残差流激活翻译成自然语言,给模型审计提供一个可读入口。BAMI 诊断 GUI grounding 的坐标偏差,并在测试时纠正一部分错误点击。Sheet as Token 把多表格工作簿压缩成可检索的 sheet 级对象,让数据智能体不必一上来就吞整本 Excel。

Reading Hidden State Before Agents Act

17 minute read

Published:

TL;DR: this round is about making hidden state readable before it becomes a wrong action. Natural Language Autoencoders translate residual-stream activations into text for model auditing. BAMI diagnoses GUI grounding failures and fixes some of them at test time. Sheet as Token turns messy multi-sheet workbooks into retrievable sheet-level objects for data agents.

让长程智能体会规划、会分工、也会验引文

4 minute read

Published:

TL;DR:本期看长程智能体如何摆脱单一 reactive loop。StraTA 让 agent 在执行前先生成全局策略,并把后续动作都放在这个策略下训练;RAO 把递归 subagent 变成可训练的推理时扩展机制;引文归因评测论文则检查 deep research 报告里的引用是否真的支撑旁边那句话。

Strategies, Subagents, and Citation Checks for Long-Horizon Work

16 minute read

Published:

TL;DR: this round is about long-horizon work that does not fit inside a single reactive loop. StraTA trains an agent to carry a global strategy through an episode. RAO trains a model to use recursive subagents rather than treating delegation as a fixed scaffold. The source-attribution paper asks whether deep-research citations actually support the claims they are attached to.

技能、检索与记忆化世界模型

4 minute read

Published:

TL;DR:本期看的是智能体工作流里的“可操作状态”。SkillOS 让智能体从经验里学会维护技能库;SIRA 把多轮搜索压缩成一次有语料意识的词法检索动作;HaM-World 用选择性记忆和几何结构稳定规划用的世界模型潜变量。

Skills, Retrieval, and Memory for Agent Workflows

13 minute read

Published:

TL;DR: this round is about agent-facing state. SkillOS learns how to curate reusable skills from experience. SIRA compresses retrieval into one corpus-discriminative lexical action. HaM-World gives planning a structured latent with memory and geometry so rollouts do not fall apart as quickly.

让中间产物能被智能体真正使用

4 minute read

Published:

TL;DR:本期看的是“中间产物”能不能被下游系统真正消费。TraceLift 讨论推理轨迹是否应该按它对冻结执行器的帮助来奖励;BRIGHT-Pro 讨论检索器是否应该覆盖一组互补证据,而不是只找一个相关段落;Agentic-imodels 则把可解释模型重新定义为“智能体读得懂、能模拟”的工具。

Intermediate Work That Agents Can Actually Use

13 minute read

Published:

TL;DR: this round is about intermediate work that another system has to consume. TraceLift asks whether a reasoning plan should be rewarded only when it helps a frozen executor. BRIGHT-Pro asks whether retrieval should cover an evidence portfolio rather than one relevant passage. Agentic-imodels asks whether an interpretable model should be readable by an agent, not only by a human analyst.

从搜索轨迹到记忆电路,再到世界理论

5 minute read

Published:

TL;DR:本期我想看的是智能体给出答案之前已经学到的结构。OpenSeeker-v2 讨论高质量、长难度搜索轨迹能把纯 SFT 搜索智能体推到什么程度;agent memory circuit 论文把写入、管理、读取三个记忆阶段拆开做电路追踪;Learning-to-Theorize 则把 world model 从“预测下一帧”推向“从观察中归纳可执行理论”。

Training Signals, Memory Circuits, and Theories of the World

17 minute read

Published:

TL;DR: this round is about structure that is learned before an agent produces the final answer. OpenSeeker-v2 asks how much frontier search-agent behavior can come from carefully filtered SFT trajectories. The agent-memory circuit paper opens the write-manage-read loop and shows that routing, extraction, and grounding emerge at different model scales. Learning-to-Theorize pushes world models away from pure prediction and toward executable, compositional theories inferred from raw observations.

先看清证据,再让模型回答

6 minute read

Published:

TL;DR:本期没有退回 4 月 30 日,而是保留 5 月 3-4 日的新论文,主题是“模型在回答或生成之前应该先看什么”。FlexSQL 让 data agent 在推理过程中反复检查 schema、取值、执行结果和计划,而不是一次性把 schema retrieval 固定下来。Chart-FR1 把密集图表推理训练成显式视觉聚焦过程,让 reasoning step 绑定 OCR 文本和局部区域。PV-VAE 则把 video VAE 从纯重建改成预测式重建,迫使 latent 携带运动和未来变化信息。

Agents That Look Before They Answer

18 minute read

Published:

TL;DR: this round stayed inside the fresh May 3-4 window and picked three papers about giving models a better inspection step before they answer or generate. FlexSQL lets a data agent revisit schemas, values, execution results, and plans instead of freezing retrieval up front. Chart-FR1 trains chart reasoning around explicit visual focus, so dense charts are not treated as one undifferentiated image. PV-VAE changes video VAE training from pure reconstruction to predictive reconstruction, pushing latents to carry motion-relevant structure rather than only pixel detail.

把证据从提示词里移出来

6 minute read

Published:

TL;DR:5 月 1 日到 3 日目标主题里的 arXiv 新稿很少,所以本期在去重后扩展到 4 月 30 日最新窗口。我选了三篇都在把证据移出 prompt 的论文:Claw-Eval-Live 用动态需求信号刷新 workflow-agent 评测,并按可观察动作打分;ObjectGraph 把文档变成可遍历的有类型图,而不是整段注入上下文;CIRM 在推理时干预 reward model 激活,降低格式捷径变成训练标签的风险。

Evidence Surfaces for Agents Outside the Prompt

16 minute read

Published:

TL;DR: the newest May 1-3 arXiv window was thin for the tracked topics, so I expanded to the freshest April 30 papers after deduplicating the existing Paper Radar list. I picked three papers that all move evidence out of the prompt: Claw-Eval-Live refreshes workflow-agent evaluation from live demand signals and grades observable action, ObjectGraph treats documents as traversable typed graphs rather than injected strings, and CIRM edits reward-model activations at inference time so style shortcuts are less likely to become training labels.

在漂移变成答案之前加检查

5 minute read

Published:

TL;DR:这一期我关注的是系统在“看起来已经完成”之前,能不能先发现或修正漂移。5 月 1 日到 3 日目标方向里没有足够新的 arXiv CS 投稿,所以我扩展到 4 月 30 日最新窗口,选了四篇开放全文:PRISM 讨论多模态模型在 RLVR 前的预对齐,PhyCo 讨论视频生成里的物理属性控制,FCMBench-Video 讨论文档证据随时间展开时的评测,Latent Adversarial Detection 讨论多轮攻击意图在激活轨迹中的信号。

Checks Before Agents Drift

20 minute read

Published:

TL;DR: this round is about checks that happen before an agent or multimodal system drifts into a polished but wrong outcome. I found no fresh May 1-3 arXiv CS submissions in the target topics, so I expanded to the newest April 30 window and selected four open papers: PRISM for pre-aligning multimodal policies before RLVR, PhyCo for injecting physical priors into video generation, FCMBench-Video for document evidence over time, and Latent Adversarial Detection for activation-level multi-turn safety probes.

让智能体有可操作的结构化表面

5 minute read

Published:

TL;DR:这一期我关注的不是“再给智能体更多上下文”或“每一步都换成更强模型”,而是智能体工作时需要什么样的结构化操作表面。入选的四篇分别讨论:用共享工作板做大规模 web-to-table 数据抽取、用事件检测决定电脑使用智能体何时升级到大模型、用 trace 分析多智能体系统中的信息污染,以及把 reward model 的机理分析目标从词表投影换成 reward head 方向。

Operating Surfaces for Agents That Need to Stay Correct

16 minute read

Published:

TL;DR: this round is about agents needing structured operating surfaces, not just longer context or more calls to a stronger model. I picked four papers after excluding the recent seen list: a web-to-table multi-agent system with a shared workboard, a step-level cascade for computer-use agents, a trace-level study of information contamination, and a reward-model interpretability library. The first three sit in the newest April 29-30 window; I expanded one slot to April 28 for the mechanism paper because the freshest sparse-autoencoder item had already been covered.

给科学研究智能体更好的结构化接口

4 minute read

Published:

TL;DR:这期我看的不是“智能体又多会做题了”,而是科学研究智能体到底需要什么接口。四篇论文分别把方法演化图谱、领域 foundation model 调用、复杂文献发现、科学可视化 workflow 做成了更明确的结构。我的判断是:研究智能体真正缺的往往不是更长上下文,而是能被查询、调用、验证和回放的中间层。

Structured Interfaces for Scientific Agents

18 minute read

Published:

TL;DR: this round is about scientific agents needing better interfaces to knowledge, tools, and intermediate workflow state. I picked four papers that make that interface explicit: a method-evolution graph for AI scientists, a framework that lets language agents call domain foundation models, a benchmark for hard literature discovery, and a controlled comparison of scientific-visualization agent interaction styles.

让数据与界面智能体面对硬证据

3 minute read

Published:

TL;DR:这期我看的是“证据对象变硬”这件事。四篇论文都不满足于让智能体写一个像样的答案,而是继续追问:科学数据任务能不能被可执行 verifier 检查,表格问答能不能识别隐含预测意图,GUI 智能体能不能到达精确状态,LLM 生成的 reward 能不能在合适的训练阶段再部署。

Hard Evidence for Data and Workflow Agents

16 minute read

Published:

TL;DR: this round is about evaluation objects getting harder. The four papers I chose are not mainly asking whether an agent writes a plausible answer. They ask whether it can pass an executable scientific checker, infer a hidden predictive intent from a table, hit an exact GUI state, or deploy a reward only when the learner is competent enough for that reward to mean something.

为什么智能体需要显式中间层

3 minute read

Published:

TL;DR:这期我盯着同一件事。真正有意思的工作,不是把端到端系统再堆大一点,而是把中间层做显式:BEV token、latent CoT、概念流形、检索页。模型一旦有了可检查的中间表示,几何、控制和证据就都更容易对齐。

Why Agents Need an Explicit Middle Layer

14 minute read

Published:

TL;DR: this round keeps circling one idea. The useful papers are not just bigger end-to-end systems, they are systems that make the middle layer explicit: BEV tokens, latent reasoning, manifold groups, retrieval pages. That is where the model gets a handle on geometry, control, or evidence.

让长程智能体拥有可回放工作区

4 minute read

Published:

TL;DR: 本期关注一个更底层的问题:长程智能体的工作过程能不能被复现、恢复和审计。我选了四篇 4 月 30 日的新论文,因为它们分别把长程智能体里容易被忽略的层显式化:Synthetic Computers 构造用户级工作区,Crab 保存沙箱操作系统状态,Exploration Hacking 检验模型是否能通过控制探索来抵抗强化学习,COHERENCE 把图文交错文档理解变成可核验的对齐任务。

Replayable Workspaces for Long-Horizon Agents

19 minute read

Published:

TL;DR: This round is about agents whose work can be replayed, repaired, and audited. I selected four April 30 papers because they each make one hidden layer of long-horizon agency explicit: Synthetic Computers builds user-specific workspaces for productivity simulation, Crab checkpoints the OS state of agent sandboxes, Exploration Hacking tests whether RL can be resisted by strategic under-exploration, and COHERENCE turns interleaved document understanding into a verifiable alignment task.

从闭环反馈到可审计工作流

3 minute read

Published:

简短 TL;DR: 本期我把注意力从“智能体能不能刷过一个 benchmark”转到更难的问题:训练和工作流结束后,过程还能不能被核验。四篇论文分别给出四个入口:FutureWorld 用真实世界事件兑现后的结果做延迟奖励;AgentSim 生成可核验的 RAG 智能体轨迹;DV-World 把数据可视化智能体放进原生软件、跨框架演化和交互澄清流程;MoRFI 则从机制层看,后训练引入新事实时为什么可能损伤模型对旧知识的访问。

Closed Loops for Auditable Agents

17 minute read

Published:

TL;DR: This round moves from “can an agent solve a benchmark?” toward a harder question: can the loop be audited after the fact? I selected four recent open papers because they put feedback and evidence into the system boundary: FutureWorld waits for real outcomes before assigning rewards, AgentSim records verifiable RAG traces, DV-World tests data-visualization agents in native and interactive workflows, and MoRFI probes how post-training on new facts leaves sparse mechanistic traces.

让长程智能体拥有可核验的状态

4 minute read

Published:

简短 TL;DR: 本期关注一个很实用的变化:更强的智能体不只是更长上下文模型,而是能生成任务、核验中间状态、保存证据、并从世界状态变化中学习的系统。我选了四篇近期开放全文论文:ClawGym 做可执行、可核验的电脑使用任务;World2VLM 把世界模型想象变成训练信号;DataPRM 让数据分析奖励模型进入环境检查步骤;OCR-Memory 则把长程历史保存成可光学检索的证据。

Verifiable State for Long-Horizon Agents

16 minute read

Published:

TL;DR: This round is about a practical shift in agent research: stronger agents are not only longer-context models, but systems that synthesize tasks, verify intermediate state, preserve evidence, and learn from world-state transitions. I selected four recent open papers because each one makes state more inspectable: ClawGym builds verifiable computer-use tasks, World2VLM turns world-model imagination into training data, DataPRM verifies data-analysis steps inside the environment, and OCR-Memory stores long agent histories as optically retrievable evidence.

DID

How to Ensure the Consistency of Estimators When Dealing with Systematic Errors in Continuous Variable Systems

7 minute read

Published:

TL;DR: Not all measurement error is fatal in panel data. In a Two-Way Fixed Effects (TWFE) model:

  • Constant and Time-Specific additive errors are harmless (absorbed by FEs).
  • Unit-Specific additive errors create bias when interacting a continuous regressor with a post-treatment dummy (Continuous DiD).
  • The Fix: You can eliminate this bias by including Unit × Post fixed effects or by using post-period centering (FWL residualization).

agents

从路径选择到路由几何

4 minute read

Published:

TL;DR:本期我关注的不是“智能体有没有中间状态”,而是这些中间结构怎样改变决策本身。ToolCUA 研究电脑使用智能体什么时候该继续点 GUI、什么时候该切到工具;DataMaster 把数据工程变成带分支、记忆和反馈的搜索;PriorZero 试图把语言模型先验接入世界模型规划,但只放在 MCTS 根节点;MoE 路由论文则从机理上解释 router 和 expert 为什么会学出共享几何,以及常见 load-balancing loss 可能怎样破坏这种几何。

Path Choices, Data Search, and Routing Geometry

15 minute read

Published:

TL;DR: this round is about decision surfaces inside systems that look agentic from the outside. ToolCUA asks when a computer-use agent should stay with GUI actions and when it should call a tool. DataMaster treats data engineering as a branching search problem rather than a one-shot preprocessing step. PriorZero tries to use language priors without letting them corrupt a learned world model. The MoE routing paper is more mechanistic: it argues that routers and experts learn a shared geometry, and that common load-balancing losses can damage it.

在智能体失败之前读懂执行轨迹

6 minute read

Published:

TL;DR:本期看的是智能体轨迹如何成为训练和诊断对象。A3 用 shell 命令结构给 CLI 智能体做逐步 credit assignment,而不是只看整条轨迹成败。Tool Calling 论文发现,模型内部的工具选择可以线性读出和干预,错误工具调用在 JSON 输出前就有信号。MASPrism 则用小模型 prefill 阶段的 NLL 和 attention,在长多智能体日志里定位可能的失败源头,而且不需要生成诊断长文。

Reading Agent Traces Before They Become Failures

17 minute read

Published:

TL;DR: this round is about agent traces as training and diagnostic objects. A3 trains CLI agents by assigning credit to shell actions rather than only to whole trajectories. The tool-calling paper shows that tool choice is linearly readable and steerable inside language models before the JSON call is emitted. MASPrism uses small-model prefill signals to locate likely failure sources in long multi-agent logs without decoding a diagnostic answer.

在行动之前读懂隐藏状态

5 minute read

Published:

简短 TL;DR:本期关注的是模型或智能体在行动之前到底“看见”了什么。Natural Language Autoencoders 把残差流激活翻译成自然语言,给模型审计提供一个可读入口。BAMI 诊断 GUI grounding 的坐标偏差,并在测试时纠正一部分错误点击。Sheet as Token 把多表格工作簿压缩成可检索的 sheet 级对象,让数据智能体不必一上来就吞整本 Excel。

Reading Hidden State Before Agents Act

17 minute read

Published:

TL;DR: this round is about making hidden state readable before it becomes a wrong action. Natural Language Autoencoders translate residual-stream activations into text for model auditing. BAMI diagnoses GUI grounding failures and fixes some of them at test time. Sheet as Token turns messy multi-sheet workbooks into retrievable sheet-level objects for data agents.

让长程智能体会规划、会分工、也会验引文

4 minute read

Published:

TL;DR:本期看长程智能体如何摆脱单一 reactive loop。StraTA 让 agent 在执行前先生成全局策略,并把后续动作都放在这个策略下训练;RAO 把递归 subagent 变成可训练的推理时扩展机制;引文归因评测论文则检查 deep research 报告里的引用是否真的支撑旁边那句话。

Strategies, Subagents, and Citation Checks for Long-Horizon Work

16 minute read

Published:

TL;DR: this round is about long-horizon work that does not fit inside a single reactive loop. StraTA trains an agent to carry a global strategy through an episode. RAO trains a model to use recursive subagents rather than treating delegation as a fixed scaffold. The source-attribution paper asks whether deep-research citations actually support the claims they are attached to.

技能、检索与记忆化世界模型

4 minute read

Published:

TL;DR:本期看的是智能体工作流里的“可操作状态”。SkillOS 让智能体从经验里学会维护技能库;SIRA 把多轮搜索压缩成一次有语料意识的词法检索动作;HaM-World 用选择性记忆和几何结构稳定规划用的世界模型潜变量。

Skills, Retrieval, and Memory for Agent Workflows

13 minute read

Published:

TL;DR: this round is about agent-facing state. SkillOS learns how to curate reusable skills from experience. SIRA compresses retrieval into one corpus-discriminative lexical action. HaM-World gives planning a structured latent with memory and geometry so rollouts do not fall apart as quickly.

让中间产物能被智能体真正使用

4 minute read

Published:

TL;DR:本期看的是“中间产物”能不能被下游系统真正消费。TraceLift 讨论推理轨迹是否应该按它对冻结执行器的帮助来奖励;BRIGHT-Pro 讨论检索器是否应该覆盖一组互补证据,而不是只找一个相关段落;Agentic-imodels 则把可解释模型重新定义为“智能体读得懂、能模拟”的工具。

Intermediate Work That Agents Can Actually Use

13 minute read

Published:

TL;DR: this round is about intermediate work that another system has to consume. TraceLift asks whether a reasoning plan should be rewarded only when it helps a frozen executor. BRIGHT-Pro asks whether retrieval should cover an evidence portfolio rather than one relevant passage. Agentic-imodels asks whether an interpretable model should be readable by an agent, not only by a human analyst.

从搜索轨迹到记忆电路,再到世界理论

5 minute read

Published:

TL;DR:本期我想看的是智能体给出答案之前已经学到的结构。OpenSeeker-v2 讨论高质量、长难度搜索轨迹能把纯 SFT 搜索智能体推到什么程度;agent memory circuit 论文把写入、管理、读取三个记忆阶段拆开做电路追踪;Learning-to-Theorize 则把 world model 从“预测下一帧”推向“从观察中归纳可执行理论”。

Training Signals, Memory Circuits, and Theories of the World

17 minute read

Published:

TL;DR: this round is about structure that is learned before an agent produces the final answer. OpenSeeker-v2 asks how much frontier search-agent behavior can come from carefully filtered SFT trajectories. The agent-memory circuit paper opens the write-manage-read loop and shows that routing, extraction, and grounding emerge at different model scales. Learning-to-Theorize pushes world models away from pure prediction and toward executable, compositional theories inferred from raw observations.

先看清证据,再让模型回答

6 minute read

Published:

TL;DR:本期没有退回 4 月 30 日,而是保留 5 月 3-4 日的新论文,主题是“模型在回答或生成之前应该先看什么”。FlexSQL 让 data agent 在推理过程中反复检查 schema、取值、执行结果和计划,而不是一次性把 schema retrieval 固定下来。Chart-FR1 把密集图表推理训练成显式视觉聚焦过程,让 reasoning step 绑定 OCR 文本和局部区域。PV-VAE 则把 video VAE 从纯重建改成预测式重建,迫使 latent 携带运动和未来变化信息。

Agents That Look Before They Answer

18 minute read

Published:

TL;DR: this round stayed inside the fresh May 3-4 window and picked three papers about giving models a better inspection step before they answer or generate. FlexSQL lets a data agent revisit schemas, values, execution results, and plans instead of freezing retrieval up front. Chart-FR1 trains chart reasoning around explicit visual focus, so dense charts are not treated as one undifferentiated image. PV-VAE changes video VAE training from pure reconstruction to predictive reconstruction, pushing latents to carry motion-relevant structure rather than only pixel detail.

把证据从提示词里移出来

6 minute read

Published:

TL;DR:5 月 1 日到 3 日目标主题里的 arXiv 新稿很少,所以本期在去重后扩展到 4 月 30 日最新窗口。我选了三篇都在把证据移出 prompt 的论文:Claw-Eval-Live 用动态需求信号刷新 workflow-agent 评测,并按可观察动作打分;ObjectGraph 把文档变成可遍历的有类型图,而不是整段注入上下文;CIRM 在推理时干预 reward model 激活,降低格式捷径变成训练标签的风险。

Evidence Surfaces for Agents Outside the Prompt

16 minute read

Published:

TL;DR: the newest May 1-3 arXiv window was thin for the tracked topics, so I expanded to the freshest April 30 papers after deduplicating the existing Paper Radar list. I picked three papers that all move evidence out of the prompt: Claw-Eval-Live refreshes workflow-agent evaluation from live demand signals and grades observable action, ObjectGraph treats documents as traversable typed graphs rather than injected strings, and CIRM edits reward-model activations at inference time so style shortcuts are less likely to become training labels.

在漂移变成答案之前加检查

5 minute read

Published:

TL;DR:这一期我关注的是系统在“看起来已经完成”之前,能不能先发现或修正漂移。5 月 1 日到 3 日目标方向里没有足够新的 arXiv CS 投稿,所以我扩展到 4 月 30 日最新窗口,选了四篇开放全文:PRISM 讨论多模态模型在 RLVR 前的预对齐,PhyCo 讨论视频生成里的物理属性控制,FCMBench-Video 讨论文档证据随时间展开时的评测,Latent Adversarial Detection 讨论多轮攻击意图在激活轨迹中的信号。

Checks Before Agents Drift

20 minute read

Published:

TL;DR: this round is about checks that happen before an agent or multimodal system drifts into a polished but wrong outcome. I found no fresh May 1-3 arXiv CS submissions in the target topics, so I expanded to the newest April 30 window and selected four open papers: PRISM for pre-aligning multimodal policies before RLVR, PhyCo for injecting physical priors into video generation, FCMBench-Video for document evidence over time, and Latent Adversarial Detection for activation-level multi-turn safety probes.

让智能体有可操作的结构化表面

5 minute read

Published:

TL;DR:这一期我关注的不是“再给智能体更多上下文”或“每一步都换成更强模型”,而是智能体工作时需要什么样的结构化操作表面。入选的四篇分别讨论:用共享工作板做大规模 web-to-table 数据抽取、用事件检测决定电脑使用智能体何时升级到大模型、用 trace 分析多智能体系统中的信息污染,以及把 reward model 的机理分析目标从词表投影换成 reward head 方向。

Operating Surfaces for Agents That Need to Stay Correct

16 minute read

Published:

TL;DR: this round is about agents needing structured operating surfaces, not just longer context or more calls to a stronger model. I picked four papers after excluding the recent seen list: a web-to-table multi-agent system with a shared workboard, a step-level cascade for computer-use agents, a trace-level study of information contamination, and a reward-model interpretability library. The first three sit in the newest April 29-30 window; I expanded one slot to April 28 for the mechanism paper because the freshest sparse-autoencoder item had already been covered.

给科学研究智能体更好的结构化接口

4 minute read

Published:

TL;DR:这期我看的不是“智能体又多会做题了”,而是科学研究智能体到底需要什么接口。四篇论文分别把方法演化图谱、领域 foundation model 调用、复杂文献发现、科学可视化 workflow 做成了更明确的结构。我的判断是:研究智能体真正缺的往往不是更长上下文,而是能被查询、调用、验证和回放的中间层。

Structured Interfaces for Scientific Agents

18 minute read

Published:

TL;DR: this round is about scientific agents needing better interfaces to knowledge, tools, and intermediate workflow state. I picked four papers that make that interface explicit: a method-evolution graph for AI scientists, a framework that lets language agents call domain foundation models, a benchmark for hard literature discovery, and a controlled comparison of scientific-visualization agent interaction styles.

让数据与界面智能体面对硬证据

3 minute read

Published:

TL;DR:这期我看的是“证据对象变硬”这件事。四篇论文都不满足于让智能体写一个像样的答案,而是继续追问:科学数据任务能不能被可执行 verifier 检查,表格问答能不能识别隐含预测意图,GUI 智能体能不能到达精确状态,LLM 生成的 reward 能不能在合适的训练阶段再部署。

Hard Evidence for Data and Workflow Agents

16 minute read

Published:

TL;DR: this round is about evaluation objects getting harder. The four papers I chose are not mainly asking whether an agent writes a plausible answer. They ask whether it can pass an executable scientific checker, infer a hidden predictive intent from a table, hit an exact GUI state, or deploy a reward only when the learner is competent enough for that reward to mean something.

为什么智能体需要显式中间层

3 minute read

Published:

TL;DR:这期我盯着同一件事。真正有意思的工作,不是把端到端系统再堆大一点,而是把中间层做显式:BEV token、latent CoT、概念流形、检索页。模型一旦有了可检查的中间表示,几何、控制和证据就都更容易对齐。

Why Agents Need an Explicit Middle Layer

14 minute read

Published:

TL;DR: this round keeps circling one idea. The useful papers are not just bigger end-to-end systems, they are systems that make the middle layer explicit: BEV tokens, latent reasoning, manifold groups, retrieval pages. That is where the model gets a handle on geometry, control, or evidence.

让长程智能体拥有可回放工作区

4 minute read

Published:

TL;DR: 本期关注一个更底层的问题:长程智能体的工作过程能不能被复现、恢复和审计。我选了四篇 4 月 30 日的新论文,因为它们分别把长程智能体里容易被忽略的层显式化:Synthetic Computers 构造用户级工作区,Crab 保存沙箱操作系统状态,Exploration Hacking 检验模型是否能通过控制探索来抵抗强化学习,COHERENCE 把图文交错文档理解变成可核验的对齐任务。

Replayable Workspaces for Long-Horizon Agents

19 minute read

Published:

TL;DR: This round is about agents whose work can be replayed, repaired, and audited. I selected four April 30 papers because they each make one hidden layer of long-horizon agency explicit: Synthetic Computers builds user-specific workspaces for productivity simulation, Crab checkpoints the OS state of agent sandboxes, Exploration Hacking tests whether RL can be resisted by strategic under-exploration, and COHERENCE turns interleaved document understanding into a verifiable alignment task.

从闭环反馈到可审计工作流

3 minute read

Published:

简短 TL;DR: 本期我把注意力从“智能体能不能刷过一个 benchmark”转到更难的问题:训练和工作流结束后,过程还能不能被核验。四篇论文分别给出四个入口:FutureWorld 用真实世界事件兑现后的结果做延迟奖励;AgentSim 生成可核验的 RAG 智能体轨迹;DV-World 把数据可视化智能体放进原生软件、跨框架演化和交互澄清流程;MoRFI 则从机制层看,后训练引入新事实时为什么可能损伤模型对旧知识的访问。

Closed Loops for Auditable Agents

17 minute read

Published:

TL;DR: This round moves from “can an agent solve a benchmark?” toward a harder question: can the loop be audited after the fact? I selected four recent open papers because they put feedback and evidence into the system boundary: FutureWorld waits for real outcomes before assigning rewards, AgentSim records verifiable RAG traces, DV-World tests data-visualization agents in native and interactive workflows, and MoRFI probes how post-training on new facts leaves sparse mechanistic traces.

让长程智能体拥有可核验的状态

4 minute read

Published:

简短 TL;DR: 本期关注一个很实用的变化:更强的智能体不只是更长上下文模型,而是能生成任务、核验中间状态、保存证据、并从世界状态变化中学习的系统。我选了四篇近期开放全文论文:ClawGym 做可执行、可核验的电脑使用任务;World2VLM 把世界模型想象变成训练信号;DataPRM 让数据分析奖励模型进入环境检查步骤;OCR-Memory 则把长程历史保存成可光学检索的证据。

Verifiable State for Long-Horizon Agents

16 minute read

Published:

TL;DR: This round is about a practical shift in agent research: stronger agents are not only longer-context models, but systems that synthesize tasks, verify intermediate state, preserve evidence, and learn from world-state transitions. I selected four recent open papers because each one makes state more inspectable: ClawGym builds verifiable computer-use tasks, World2VLM turns world-model imagination into training data, DataPRM verifies data-analysis steps inside the environment, and OCR-Memory stores long agent histories as optically retrievable evidence.

econometrics

How to Ensure the Consistency of Estimators When Dealing with Systematic Errors in Continuous Variable Systems

7 minute read

Published:

TL;DR: Not all measurement error is fatal in panel data. In a Two-Way Fixed Effects (TWFE) model:

  • Constant and Time-Specific additive errors are harmless (absorbed by FEs).
  • Unit-Specific additive errors create bias when interacting a continuous regressor with a post-treatment dummy (Continuous DiD).
  • The Fix: You can eliminate this bias by including Unit × Post fixed effects or by using post-period centering (FWL residualization).

measurement error

How to Ensure the Consistency of Estimators When Dealing with Systematic Errors in Continuous Variable Systems

7 minute read

Published:

TL;DR: Not all measurement error is fatal in panel data. In a Two-Way Fixed Effects (TWFE) model:

  • Constant and Time-Specific additive errors are harmless (absorbed by FEs).
  • Unit-Specific additive errors create bias when interacting a continuous regressor with a post-treatment dummy (Continuous DiD).
  • The Fix: You can eliminate this bias by including Unit × Post fixed effects or by using post-period centering (FWL residualization).

paper-digest

从路径选择到路由几何

4 minute read

Published:

TL;DR:本期我关注的不是“智能体有没有中间状态”,而是这些中间结构怎样改变决策本身。ToolCUA 研究电脑使用智能体什么时候该继续点 GUI、什么时候该切到工具;DataMaster 把数据工程变成带分支、记忆和反馈的搜索;PriorZero 试图把语言模型先验接入世界模型规划,但只放在 MCTS 根节点;MoE 路由论文则从机理上解释 router 和 expert 为什么会学出共享几何,以及常见 load-balancing loss 可能怎样破坏这种几何。

Path Choices, Data Search, and Routing Geometry

15 minute read

Published:

TL;DR: this round is about decision surfaces inside systems that look agentic from the outside. ToolCUA asks when a computer-use agent should stay with GUI actions and when it should call a tool. DataMaster treats data engineering as a branching search problem rather than a one-shot preprocessing step. PriorZero tries to use language priors without letting them corrupt a learned world model. The MoE routing paper is more mechanistic: it argues that routers and experts learn a shared geometry, and that common load-balancing losses can damage it.

在智能体失败之前读懂执行轨迹

6 minute read

Published:

TL;DR:本期看的是智能体轨迹如何成为训练和诊断对象。A3 用 shell 命令结构给 CLI 智能体做逐步 credit assignment,而不是只看整条轨迹成败。Tool Calling 论文发现,模型内部的工具选择可以线性读出和干预,错误工具调用在 JSON 输出前就有信号。MASPrism 则用小模型 prefill 阶段的 NLL 和 attention,在长多智能体日志里定位可能的失败源头,而且不需要生成诊断长文。

Reading Agent Traces Before They Become Failures

17 minute read

Published:

TL;DR: this round is about agent traces as training and diagnostic objects. A3 trains CLI agents by assigning credit to shell actions rather than only to whole trajectories. The tool-calling paper shows that tool choice is linearly readable and steerable inside language models before the JSON call is emitted. MASPrism uses small-model prefill signals to locate likely failure sources in long multi-agent logs without decoding a diagnostic answer.

在行动之前读懂隐藏状态

5 minute read

Published:

简短 TL;DR:本期关注的是模型或智能体在行动之前到底“看见”了什么。Natural Language Autoencoders 把残差流激活翻译成自然语言,给模型审计提供一个可读入口。BAMI 诊断 GUI grounding 的坐标偏差,并在测试时纠正一部分错误点击。Sheet as Token 把多表格工作簿压缩成可检索的 sheet 级对象,让数据智能体不必一上来就吞整本 Excel。

Reading Hidden State Before Agents Act

17 minute read

Published:

TL;DR: this round is about making hidden state readable before it becomes a wrong action. Natural Language Autoencoders translate residual-stream activations into text for model auditing. BAMI diagnoses GUI grounding failures and fixes some of them at test time. Sheet as Token turns messy multi-sheet workbooks into retrievable sheet-level objects for data agents.

让长程智能体会规划、会分工、也会验引文

4 minute read

Published:

TL;DR:本期看长程智能体如何摆脱单一 reactive loop。StraTA 让 agent 在执行前先生成全局策略,并把后续动作都放在这个策略下训练;RAO 把递归 subagent 变成可训练的推理时扩展机制;引文归因评测论文则检查 deep research 报告里的引用是否真的支撑旁边那句话。

Strategies, Subagents, and Citation Checks for Long-Horizon Work

16 minute read

Published:

TL;DR: this round is about long-horizon work that does not fit inside a single reactive loop. StraTA trains an agent to carry a global strategy through an episode. RAO trains a model to use recursive subagents rather than treating delegation as a fixed scaffold. The source-attribution paper asks whether deep-research citations actually support the claims they are attached to.

技能、检索与记忆化世界模型

4 minute read

Published:

TL;DR:本期看的是智能体工作流里的“可操作状态”。SkillOS 让智能体从经验里学会维护技能库;SIRA 把多轮搜索压缩成一次有语料意识的词法检索动作;HaM-World 用选择性记忆和几何结构稳定规划用的世界模型潜变量。

Skills, Retrieval, and Memory for Agent Workflows

13 minute read

Published:

TL;DR: this round is about agent-facing state. SkillOS learns how to curate reusable skills from experience. SIRA compresses retrieval into one corpus-discriminative lexical action. HaM-World gives planning a structured latent with memory and geometry so rollouts do not fall apart as quickly.

让中间产物能被智能体真正使用

4 minute read

Published:

TL;DR:本期看的是“中间产物”能不能被下游系统真正消费。TraceLift 讨论推理轨迹是否应该按它对冻结执行器的帮助来奖励;BRIGHT-Pro 讨论检索器是否应该覆盖一组互补证据,而不是只找一个相关段落;Agentic-imodels 则把可解释模型重新定义为“智能体读得懂、能模拟”的工具。

Intermediate Work That Agents Can Actually Use

13 minute read

Published:

TL;DR: this round is about intermediate work that another system has to consume. TraceLift asks whether a reasoning plan should be rewarded only when it helps a frozen executor. BRIGHT-Pro asks whether retrieval should cover an evidence portfolio rather than one relevant passage. Agentic-imodels asks whether an interpretable model should be readable by an agent, not only by a human analyst.

从搜索轨迹到记忆电路,再到世界理论

5 minute read

Published:

TL;DR:本期我想看的是智能体给出答案之前已经学到的结构。OpenSeeker-v2 讨论高质量、长难度搜索轨迹能把纯 SFT 搜索智能体推到什么程度;agent memory circuit 论文把写入、管理、读取三个记忆阶段拆开做电路追踪;Learning-to-Theorize 则把 world model 从“预测下一帧”推向“从观察中归纳可执行理论”。

Training Signals, Memory Circuits, and Theories of the World

17 minute read

Published:

TL;DR: this round is about structure that is learned before an agent produces the final answer. OpenSeeker-v2 asks how much frontier search-agent behavior can come from carefully filtered SFT trajectories. The agent-memory circuit paper opens the write-manage-read loop and shows that routing, extraction, and grounding emerge at different model scales. Learning-to-Theorize pushes world models away from pure prediction and toward executable, compositional theories inferred from raw observations.

先看清证据,再让模型回答

6 minute read

Published:

TL;DR:本期没有退回 4 月 30 日,而是保留 5 月 3-4 日的新论文,主题是“模型在回答或生成之前应该先看什么”。FlexSQL 让 data agent 在推理过程中反复检查 schema、取值、执行结果和计划,而不是一次性把 schema retrieval 固定下来。Chart-FR1 把密集图表推理训练成显式视觉聚焦过程,让 reasoning step 绑定 OCR 文本和局部区域。PV-VAE 则把 video VAE 从纯重建改成预测式重建,迫使 latent 携带运动和未来变化信息。

Agents That Look Before They Answer

18 minute read

Published:

TL;DR: this round stayed inside the fresh May 3-4 window and picked three papers about giving models a better inspection step before they answer or generate. FlexSQL lets a data agent revisit schemas, values, execution results, and plans instead of freezing retrieval up front. Chart-FR1 trains chart reasoning around explicit visual focus, so dense charts are not treated as one undifferentiated image. PV-VAE changes video VAE training from pure reconstruction to predictive reconstruction, pushing latents to carry motion-relevant structure rather than only pixel detail.

把证据从提示词里移出来

6 minute read

Published:

TL;DR:5 月 1 日到 3 日目标主题里的 arXiv 新稿很少,所以本期在去重后扩展到 4 月 30 日最新窗口。我选了三篇都在把证据移出 prompt 的论文:Claw-Eval-Live 用动态需求信号刷新 workflow-agent 评测,并按可观察动作打分;ObjectGraph 把文档变成可遍历的有类型图,而不是整段注入上下文;CIRM 在推理时干预 reward model 激活,降低格式捷径变成训练标签的风险。

Evidence Surfaces for Agents Outside the Prompt

16 minute read

Published:

TL;DR: the newest May 1-3 arXiv window was thin for the tracked topics, so I expanded to the freshest April 30 papers after deduplicating the existing Paper Radar list. I picked three papers that all move evidence out of the prompt: Claw-Eval-Live refreshes workflow-agent evaluation from live demand signals and grades observable action, ObjectGraph treats documents as traversable typed graphs rather than injected strings, and CIRM edits reward-model activations at inference time so style shortcuts are less likely to become training labels.

在漂移变成答案之前加检查

5 minute read

Published:

TL;DR:这一期我关注的是系统在“看起来已经完成”之前,能不能先发现或修正漂移。5 月 1 日到 3 日目标方向里没有足够新的 arXiv CS 投稿,所以我扩展到 4 月 30 日最新窗口,选了四篇开放全文:PRISM 讨论多模态模型在 RLVR 前的预对齐,PhyCo 讨论视频生成里的物理属性控制,FCMBench-Video 讨论文档证据随时间展开时的评测,Latent Adversarial Detection 讨论多轮攻击意图在激活轨迹中的信号。

Checks Before Agents Drift

20 minute read

Published:

TL;DR: this round is about checks that happen before an agent or multimodal system drifts into a polished but wrong outcome. I found no fresh May 1-3 arXiv CS submissions in the target topics, so I expanded to the newest April 30 window and selected four open papers: PRISM for pre-aligning multimodal policies before RLVR, PhyCo for injecting physical priors into video generation, FCMBench-Video for document evidence over time, and Latent Adversarial Detection for activation-level multi-turn safety probes.

让智能体有可操作的结构化表面

5 minute read

Published:

TL;DR:这一期我关注的不是“再给智能体更多上下文”或“每一步都换成更强模型”,而是智能体工作时需要什么样的结构化操作表面。入选的四篇分别讨论:用共享工作板做大规模 web-to-table 数据抽取、用事件检测决定电脑使用智能体何时升级到大模型、用 trace 分析多智能体系统中的信息污染,以及把 reward model 的机理分析目标从词表投影换成 reward head 方向。

Operating Surfaces for Agents That Need to Stay Correct

16 minute read

Published:

TL;DR: this round is about agents needing structured operating surfaces, not just longer context or more calls to a stronger model. I picked four papers after excluding the recent seen list: a web-to-table multi-agent system with a shared workboard, a step-level cascade for computer-use agents, a trace-level study of information contamination, and a reward-model interpretability library. The first three sit in the newest April 29-30 window; I expanded one slot to April 28 for the mechanism paper because the freshest sparse-autoencoder item had already been covered.

给科学研究智能体更好的结构化接口

4 minute read

Published:

TL;DR:这期我看的不是“智能体又多会做题了”,而是科学研究智能体到底需要什么接口。四篇论文分别把方法演化图谱、领域 foundation model 调用、复杂文献发现、科学可视化 workflow 做成了更明确的结构。我的判断是:研究智能体真正缺的往往不是更长上下文,而是能被查询、调用、验证和回放的中间层。

Structured Interfaces for Scientific Agents

18 minute read

Published:

TL;DR: this round is about scientific agents needing better interfaces to knowledge, tools, and intermediate workflow state. I picked four papers that make that interface explicit: a method-evolution graph for AI scientists, a framework that lets language agents call domain foundation models, a benchmark for hard literature discovery, and a controlled comparison of scientific-visualization agent interaction styles.

让数据与界面智能体面对硬证据

3 minute read

Published:

TL;DR:这期我看的是“证据对象变硬”这件事。四篇论文都不满足于让智能体写一个像样的答案,而是继续追问:科学数据任务能不能被可执行 verifier 检查,表格问答能不能识别隐含预测意图,GUI 智能体能不能到达精确状态,LLM 生成的 reward 能不能在合适的训练阶段再部署。

Hard Evidence for Data and Workflow Agents

16 minute read

Published:

TL;DR: this round is about evaluation objects getting harder. The four papers I chose are not mainly asking whether an agent writes a plausible answer. They ask whether it can pass an executable scientific checker, infer a hidden predictive intent from a table, hit an exact GUI state, or deploy a reward only when the learner is competent enough for that reward to mean something.

为什么智能体需要显式中间层

3 minute read

Published:

TL;DR:这期我盯着同一件事。真正有意思的工作,不是把端到端系统再堆大一点,而是把中间层做显式:BEV token、latent CoT、概念流形、检索页。模型一旦有了可检查的中间表示,几何、控制和证据就都更容易对齐。

Why Agents Need an Explicit Middle Layer

14 minute read

Published:

TL;DR: this round keeps circling one idea. The useful papers are not just bigger end-to-end systems, they are systems that make the middle layer explicit: BEV tokens, latent reasoning, manifold groups, retrieval pages. That is where the model gets a handle on geometry, control, or evidence.

让长程智能体拥有可回放工作区

4 minute read

Published:

TL;DR: 本期关注一个更底层的问题:长程智能体的工作过程能不能被复现、恢复和审计。我选了四篇 4 月 30 日的新论文,因为它们分别把长程智能体里容易被忽略的层显式化:Synthetic Computers 构造用户级工作区,Crab 保存沙箱操作系统状态,Exploration Hacking 检验模型是否能通过控制探索来抵抗强化学习,COHERENCE 把图文交错文档理解变成可核验的对齐任务。

Replayable Workspaces for Long-Horizon Agents

19 minute read

Published:

TL;DR: This round is about agents whose work can be replayed, repaired, and audited. I selected four April 30 papers because they each make one hidden layer of long-horizon agency explicit: Synthetic Computers builds user-specific workspaces for productivity simulation, Crab checkpoints the OS state of agent sandboxes, Exploration Hacking tests whether RL can be resisted by strategic under-exploration, and COHERENCE turns interleaved document understanding into a verifiable alignment task.

从闭环反馈到可审计工作流

3 minute read

Published:

简短 TL;DR: 本期我把注意力从“智能体能不能刷过一个 benchmark”转到更难的问题:训练和工作流结束后,过程还能不能被核验。四篇论文分别给出四个入口:FutureWorld 用真实世界事件兑现后的结果做延迟奖励;AgentSim 生成可核验的 RAG 智能体轨迹;DV-World 把数据可视化智能体放进原生软件、跨框架演化和交互澄清流程;MoRFI 则从机制层看,后训练引入新事实时为什么可能损伤模型对旧知识的访问。

Closed Loops for Auditable Agents

17 minute read

Published:

TL;DR: This round moves from “can an agent solve a benchmark?” toward a harder question: can the loop be audited after the fact? I selected four recent open papers because they put feedback and evidence into the system boundary: FutureWorld waits for real outcomes before assigning rewards, AgentSim records verifiable RAG traces, DV-World tests data-visualization agents in native and interactive workflows, and MoRFI probes how post-training on new facts leaves sparse mechanistic traces.

让长程智能体拥有可核验的状态

4 minute read

Published:

简短 TL;DR: 本期关注一个很实用的变化:更强的智能体不只是更长上下文模型,而是能生成任务、核验中间状态、保存证据、并从世界状态变化中学习的系统。我选了四篇近期开放全文论文:ClawGym 做可执行、可核验的电脑使用任务;World2VLM 把世界模型想象变成训练信号;DataPRM 让数据分析奖励模型进入环境检查步骤;OCR-Memory 则把长程历史保存成可光学检索的证据。

Verifiable State for Long-Horizon Agents

16 minute read

Published:

TL;DR: This round is about a practical shift in agent research: stronger agents are not only longer-context models, but systems that synthesize tasks, verify intermediate state, preserve evidence, and learn from world-state transitions. I selected four recent open papers because each one makes state more inspectable: ClawGym builds verifiable computer-use tasks, World2VLM turns world-model imagination into training data, DataPRM verifies data-analysis steps inside the environment, and OCR-Memory stores long agent histories as optically retrievable evidence.

paper-radar

从路径选择到路由几何

4 minute read

Published:

TL;DR:本期我关注的不是“智能体有没有中间状态”,而是这些中间结构怎样改变决策本身。ToolCUA 研究电脑使用智能体什么时候该继续点 GUI、什么时候该切到工具;DataMaster 把数据工程变成带分支、记忆和反馈的搜索;PriorZero 试图把语言模型先验接入世界模型规划,但只放在 MCTS 根节点;MoE 路由论文则从机理上解释 router 和 expert 为什么会学出共享几何,以及常见 load-balancing loss 可能怎样破坏这种几何。

Path Choices, Data Search, and Routing Geometry

15 minute read

Published:

TL;DR: this round is about decision surfaces inside systems that look agentic from the outside. ToolCUA asks when a computer-use agent should stay with GUI actions and when it should call a tool. DataMaster treats data engineering as a branching search problem rather than a one-shot preprocessing step. PriorZero tries to use language priors without letting them corrupt a learned world model. The MoE routing paper is more mechanistic: it argues that routers and experts learn a shared geometry, and that common load-balancing losses can damage it.

在智能体失败之前读懂执行轨迹

6 minute read

Published:

TL;DR:本期看的是智能体轨迹如何成为训练和诊断对象。A3 用 shell 命令结构给 CLI 智能体做逐步 credit assignment,而不是只看整条轨迹成败。Tool Calling 论文发现,模型内部的工具选择可以线性读出和干预,错误工具调用在 JSON 输出前就有信号。MASPrism 则用小模型 prefill 阶段的 NLL 和 attention,在长多智能体日志里定位可能的失败源头,而且不需要生成诊断长文。

Reading Agent Traces Before They Become Failures

17 minute read

Published:

TL;DR: this round is about agent traces as training and diagnostic objects. A3 trains CLI agents by assigning credit to shell actions rather than only to whole trajectories. The tool-calling paper shows that tool choice is linearly readable and steerable inside language models before the JSON call is emitted. MASPrism uses small-model prefill signals to locate likely failure sources in long multi-agent logs without decoding a diagnostic answer.

在行动之前读懂隐藏状态

5 minute read

Published:

简短 TL;DR:本期关注的是模型或智能体在行动之前到底“看见”了什么。Natural Language Autoencoders 把残差流激活翻译成自然语言,给模型审计提供一个可读入口。BAMI 诊断 GUI grounding 的坐标偏差,并在测试时纠正一部分错误点击。Sheet as Token 把多表格工作簿压缩成可检索的 sheet 级对象,让数据智能体不必一上来就吞整本 Excel。

Reading Hidden State Before Agents Act

17 minute read

Published:

TL;DR: this round is about making hidden state readable before it becomes a wrong action. Natural Language Autoencoders translate residual-stream activations into text for model auditing. BAMI diagnoses GUI grounding failures and fixes some of them at test time. Sheet as Token turns messy multi-sheet workbooks into retrievable sheet-level objects for data agents.

让长程智能体会规划、会分工、也会验引文

4 minute read

Published:

TL;DR:本期看长程智能体如何摆脱单一 reactive loop。StraTA 让 agent 在执行前先生成全局策略,并把后续动作都放在这个策略下训练;RAO 把递归 subagent 变成可训练的推理时扩展机制;引文归因评测论文则检查 deep research 报告里的引用是否真的支撑旁边那句话。

Strategies, Subagents, and Citation Checks for Long-Horizon Work

16 minute read

Published:

TL;DR: this round is about long-horizon work that does not fit inside a single reactive loop. StraTA trains an agent to carry a global strategy through an episode. RAO trains a model to use recursive subagents rather than treating delegation as a fixed scaffold. The source-attribution paper asks whether deep-research citations actually support the claims they are attached to.

技能、检索与记忆化世界模型

4 minute read

Published:

TL;DR:本期看的是智能体工作流里的“可操作状态”。SkillOS 让智能体从经验里学会维护技能库;SIRA 把多轮搜索压缩成一次有语料意识的词法检索动作;HaM-World 用选择性记忆和几何结构稳定规划用的世界模型潜变量。

Skills, Retrieval, and Memory for Agent Workflows

13 minute read

Published:

TL;DR: this round is about agent-facing state. SkillOS learns how to curate reusable skills from experience. SIRA compresses retrieval into one corpus-discriminative lexical action. HaM-World gives planning a structured latent with memory and geometry so rollouts do not fall apart as quickly.

让中间产物能被智能体真正使用

4 minute read

Published:

TL;DR:本期看的是“中间产物”能不能被下游系统真正消费。TraceLift 讨论推理轨迹是否应该按它对冻结执行器的帮助来奖励;BRIGHT-Pro 讨论检索器是否应该覆盖一组互补证据,而不是只找一个相关段落;Agentic-imodels 则把可解释模型重新定义为“智能体读得懂、能模拟”的工具。

Intermediate Work That Agents Can Actually Use

13 minute read

Published:

TL;DR: this round is about intermediate work that another system has to consume. TraceLift asks whether a reasoning plan should be rewarded only when it helps a frozen executor. BRIGHT-Pro asks whether retrieval should cover an evidence portfolio rather than one relevant passage. Agentic-imodels asks whether an interpretable model should be readable by an agent, not only by a human analyst.

从搜索轨迹到记忆电路,再到世界理论

5 minute read

Published:

TL;DR:本期我想看的是智能体给出答案之前已经学到的结构。OpenSeeker-v2 讨论高质量、长难度搜索轨迹能把纯 SFT 搜索智能体推到什么程度;agent memory circuit 论文把写入、管理、读取三个记忆阶段拆开做电路追踪;Learning-to-Theorize 则把 world model 从“预测下一帧”推向“从观察中归纳可执行理论”。

Training Signals, Memory Circuits, and Theories of the World

17 minute read

Published:

TL;DR: this round is about structure that is learned before an agent produces the final answer. OpenSeeker-v2 asks how much frontier search-agent behavior can come from carefully filtered SFT trajectories. The agent-memory circuit paper opens the write-manage-read loop and shows that routing, extraction, and grounding emerge at different model scales. Learning-to-Theorize pushes world models away from pure prediction and toward executable, compositional theories inferred from raw observations.

先看清证据,再让模型回答

6 minute read

Published:

TL;DR:本期没有退回 4 月 30 日,而是保留 5 月 3-4 日的新论文,主题是“模型在回答或生成之前应该先看什么”。FlexSQL 让 data agent 在推理过程中反复检查 schema、取值、执行结果和计划,而不是一次性把 schema retrieval 固定下来。Chart-FR1 把密集图表推理训练成显式视觉聚焦过程,让 reasoning step 绑定 OCR 文本和局部区域。PV-VAE 则把 video VAE 从纯重建改成预测式重建,迫使 latent 携带运动和未来变化信息。

Agents That Look Before They Answer

18 minute read

Published:

TL;DR: this round stayed inside the fresh May 3-4 window and picked three papers about giving models a better inspection step before they answer or generate. FlexSQL lets a data agent revisit schemas, values, execution results, and plans instead of freezing retrieval up front. Chart-FR1 trains chart reasoning around explicit visual focus, so dense charts are not treated as one undifferentiated image. PV-VAE changes video VAE training from pure reconstruction to predictive reconstruction, pushing latents to carry motion-relevant structure rather than only pixel detail.

把证据从提示词里移出来

6 minute read

Published:

TL;DR:5 月 1 日到 3 日目标主题里的 arXiv 新稿很少,所以本期在去重后扩展到 4 月 30 日最新窗口。我选了三篇都在把证据移出 prompt 的论文:Claw-Eval-Live 用动态需求信号刷新 workflow-agent 评测,并按可观察动作打分;ObjectGraph 把文档变成可遍历的有类型图,而不是整段注入上下文;CIRM 在推理时干预 reward model 激活,降低格式捷径变成训练标签的风险。

Evidence Surfaces for Agents Outside the Prompt

16 minute read

Published:

TL;DR: the newest May 1-3 arXiv window was thin for the tracked topics, so I expanded to the freshest April 30 papers after deduplicating the existing Paper Radar list. I picked three papers that all move evidence out of the prompt: Claw-Eval-Live refreshes workflow-agent evaluation from live demand signals and grades observable action, ObjectGraph treats documents as traversable typed graphs rather than injected strings, and CIRM edits reward-model activations at inference time so style shortcuts are less likely to become training labels.

在漂移变成答案之前加检查

5 minute read

Published:

TL;DR:这一期我关注的是系统在“看起来已经完成”之前,能不能先发现或修正漂移。5 月 1 日到 3 日目标方向里没有足够新的 arXiv CS 投稿,所以我扩展到 4 月 30 日最新窗口,选了四篇开放全文:PRISM 讨论多模态模型在 RLVR 前的预对齐,PhyCo 讨论视频生成里的物理属性控制,FCMBench-Video 讨论文档证据随时间展开时的评测,Latent Adversarial Detection 讨论多轮攻击意图在激活轨迹中的信号。

Checks Before Agents Drift

20 minute read

Published:

TL;DR: this round is about checks that happen before an agent or multimodal system drifts into a polished but wrong outcome. I found no fresh May 1-3 arXiv CS submissions in the target topics, so I expanded to the newest April 30 window and selected four open papers: PRISM for pre-aligning multimodal policies before RLVR, PhyCo for injecting physical priors into video generation, FCMBench-Video for document evidence over time, and Latent Adversarial Detection for activation-level multi-turn safety probes.

让智能体有可操作的结构化表面

5 minute read

Published:

TL;DR:这一期我关注的不是“再给智能体更多上下文”或“每一步都换成更强模型”,而是智能体工作时需要什么样的结构化操作表面。入选的四篇分别讨论:用共享工作板做大规模 web-to-table 数据抽取、用事件检测决定电脑使用智能体何时升级到大模型、用 trace 分析多智能体系统中的信息污染,以及把 reward model 的机理分析目标从词表投影换成 reward head 方向。

Operating Surfaces for Agents That Need to Stay Correct

16 minute read

Published:

TL;DR: this round is about agents needing structured operating surfaces, not just longer context or more calls to a stronger model. I picked four papers after excluding the recent seen list: a web-to-table multi-agent system with a shared workboard, a step-level cascade for computer-use agents, a trace-level study of information contamination, and a reward-model interpretability library. The first three sit in the newest April 29-30 window; I expanded one slot to April 28 for the mechanism paper because the freshest sparse-autoencoder item had already been covered.

给科学研究智能体更好的结构化接口

4 minute read

Published:

TL;DR:这期我看的不是“智能体又多会做题了”,而是科学研究智能体到底需要什么接口。四篇论文分别把方法演化图谱、领域 foundation model 调用、复杂文献发现、科学可视化 workflow 做成了更明确的结构。我的判断是:研究智能体真正缺的往往不是更长上下文,而是能被查询、调用、验证和回放的中间层。

Structured Interfaces for Scientific Agents

18 minute read

Published:

TL;DR: this round is about scientific agents needing better interfaces to knowledge, tools, and intermediate workflow state. I picked four papers that make that interface explicit: a method-evolution graph for AI scientists, a framework that lets language agents call domain foundation models, a benchmark for hard literature discovery, and a controlled comparison of scientific-visualization agent interaction styles.

让数据与界面智能体面对硬证据

3 minute read

Published:

TL;DR:这期我看的是“证据对象变硬”这件事。四篇论文都不满足于让智能体写一个像样的答案,而是继续追问:科学数据任务能不能被可执行 verifier 检查,表格问答能不能识别隐含预测意图,GUI 智能体能不能到达精确状态,LLM 生成的 reward 能不能在合适的训练阶段再部署。

Hard Evidence for Data and Workflow Agents

16 minute read

Published:

TL;DR: this round is about evaluation objects getting harder. The four papers I chose are not mainly asking whether an agent writes a plausible answer. They ask whether it can pass an executable scientific checker, infer a hidden predictive intent from a table, hit an exact GUI state, or deploy a reward only when the learner is competent enough for that reward to mean something.

为什么智能体需要显式中间层

3 minute read

Published:

TL;DR:这期我盯着同一件事。真正有意思的工作,不是把端到端系统再堆大一点,而是把中间层做显式:BEV token、latent CoT、概念流形、检索页。模型一旦有了可检查的中间表示,几何、控制和证据就都更容易对齐。

Why Agents Need an Explicit Middle Layer

14 minute read

Published:

TL;DR: this round keeps circling one idea. The useful papers are not just bigger end-to-end systems, they are systems that make the middle layer explicit: BEV tokens, latent reasoning, manifold groups, retrieval pages. That is where the model gets a handle on geometry, control, or evidence.

让长程智能体拥有可回放工作区

4 minute read

Published:

TL;DR: 本期关注一个更底层的问题:长程智能体的工作过程能不能被复现、恢复和审计。我选了四篇 4 月 30 日的新论文,因为它们分别把长程智能体里容易被忽略的层显式化:Synthetic Computers 构造用户级工作区,Crab 保存沙箱操作系统状态,Exploration Hacking 检验模型是否能通过控制探索来抵抗强化学习,COHERENCE 把图文交错文档理解变成可核验的对齐任务。

Replayable Workspaces for Long-Horizon Agents

19 minute read

Published:

TL;DR: This round is about agents whose work can be replayed, repaired, and audited. I selected four April 30 papers because they each make one hidden layer of long-horizon agency explicit: Synthetic Computers builds user-specific workspaces for productivity simulation, Crab checkpoints the OS state of agent sandboxes, Exploration Hacking tests whether RL can be resisted by strategic under-exploration, and COHERENCE turns interleaved document understanding into a verifiable alignment task.

从闭环反馈到可审计工作流

3 minute read

Published:

简短 TL;DR: 本期我把注意力从“智能体能不能刷过一个 benchmark”转到更难的问题:训练和工作流结束后,过程还能不能被核验。四篇论文分别给出四个入口:FutureWorld 用真实世界事件兑现后的结果做延迟奖励;AgentSim 生成可核验的 RAG 智能体轨迹;DV-World 把数据可视化智能体放进原生软件、跨框架演化和交互澄清流程;MoRFI 则从机制层看,后训练引入新事实时为什么可能损伤模型对旧知识的访问。

Closed Loops for Auditable Agents

17 minute read

Published:

TL;DR: This round moves from “can an agent solve a benchmark?” toward a harder question: can the loop be audited after the fact? I selected four recent open papers because they put feedback and evidence into the system boundary: FutureWorld waits for real outcomes before assigning rewards, AgentSim records verifiable RAG traces, DV-World tests data-visualization agents in native and interactive workflows, and MoRFI probes how post-training on new facts leaves sparse mechanistic traces.

让长程智能体拥有可核验的状态

4 minute read

Published:

简短 TL;DR: 本期关注一个很实用的变化:更强的智能体不只是更长上下文模型,而是能生成任务、核验中间状态、保存证据、并从世界状态变化中学习的系统。我选了四篇近期开放全文论文:ClawGym 做可执行、可核验的电脑使用任务;World2VLM 把世界模型想象变成训练信号;DataPRM 让数据分析奖励模型进入环境检查步骤;OCR-Memory 则把长程历史保存成可光学检索的证据。

Verifiable State for Long-Horizon Agents

16 minute read

Published:

TL;DR: This round is about a practical shift in agent research: stronger agents are not only longer-context models, but systems that synthesize tasks, verify intermediate state, preserve evidence, and learn from world-state transitions. I selected four recent open papers because each one makes state more inspectable: ClawGym builds verifiable computer-use tasks, World2VLM turns world-model imagination into training data, DataPRM verifies data-analysis steps inside the environment, and OCR-Memory stores long agent histories as optically retrievable evidence.

research

从路径选择到路由几何

4 minute read

Published:

TL;DR:本期我关注的不是“智能体有没有中间状态”,而是这些中间结构怎样改变决策本身。ToolCUA 研究电脑使用智能体什么时候该继续点 GUI、什么时候该切到工具;DataMaster 把数据工程变成带分支、记忆和反馈的搜索;PriorZero 试图把语言模型先验接入世界模型规划,但只放在 MCTS 根节点;MoE 路由论文则从机理上解释 router 和 expert 为什么会学出共享几何,以及常见 load-balancing loss 可能怎样破坏这种几何。

Path Choices, Data Search, and Routing Geometry

15 minute read

Published:

TL;DR: this round is about decision surfaces inside systems that look agentic from the outside. ToolCUA asks when a computer-use agent should stay with GUI actions and when it should call a tool. DataMaster treats data engineering as a branching search problem rather than a one-shot preprocessing step. PriorZero tries to use language priors without letting them corrupt a learned world model. The MoE routing paper is more mechanistic: it argues that routers and experts learn a shared geometry, and that common load-balancing losses can damage it.

在智能体失败之前读懂执行轨迹

6 minute read

Published:

TL;DR:本期看的是智能体轨迹如何成为训练和诊断对象。A3 用 shell 命令结构给 CLI 智能体做逐步 credit assignment,而不是只看整条轨迹成败。Tool Calling 论文发现,模型内部的工具选择可以线性读出和干预,错误工具调用在 JSON 输出前就有信号。MASPrism 则用小模型 prefill 阶段的 NLL 和 attention,在长多智能体日志里定位可能的失败源头,而且不需要生成诊断长文。

Reading Agent Traces Before They Become Failures

17 minute read

Published:

TL;DR: this round is about agent traces as training and diagnostic objects. A3 trains CLI agents by assigning credit to shell actions rather than only to whole trajectories. The tool-calling paper shows that tool choice is linearly readable and steerable inside language models before the JSON call is emitted. MASPrism uses small-model prefill signals to locate likely failure sources in long multi-agent logs without decoding a diagnostic answer.

在行动之前读懂隐藏状态

5 minute read

Published:

简短 TL;DR:本期关注的是模型或智能体在行动之前到底“看见”了什么。Natural Language Autoencoders 把残差流激活翻译成自然语言,给模型审计提供一个可读入口。BAMI 诊断 GUI grounding 的坐标偏差,并在测试时纠正一部分错误点击。Sheet as Token 把多表格工作簿压缩成可检索的 sheet 级对象,让数据智能体不必一上来就吞整本 Excel。

Reading Hidden State Before Agents Act

17 minute read

Published:

TL;DR: this round is about making hidden state readable before it becomes a wrong action. Natural Language Autoencoders translate residual-stream activations into text for model auditing. BAMI diagnoses GUI grounding failures and fixes some of them at test time. Sheet as Token turns messy multi-sheet workbooks into retrievable sheet-level objects for data agents.

让长程智能体会规划、会分工、也会验引文

4 minute read

Published:

TL;DR:本期看长程智能体如何摆脱单一 reactive loop。StraTA 让 agent 在执行前先生成全局策略,并把后续动作都放在这个策略下训练;RAO 把递归 subagent 变成可训练的推理时扩展机制;引文归因评测论文则检查 deep research 报告里的引用是否真的支撑旁边那句话。

Strategies, Subagents, and Citation Checks for Long-Horizon Work

16 minute read

Published:

TL;DR: this round is about long-horizon work that does not fit inside a single reactive loop. StraTA trains an agent to carry a global strategy through an episode. RAO trains a model to use recursive subagents rather than treating delegation as a fixed scaffold. The source-attribution paper asks whether deep-research citations actually support the claims they are attached to.

技能、检索与记忆化世界模型

4 minute read

Published:

TL;DR:本期看的是智能体工作流里的“可操作状态”。SkillOS 让智能体从经验里学会维护技能库;SIRA 把多轮搜索压缩成一次有语料意识的词法检索动作;HaM-World 用选择性记忆和几何结构稳定规划用的世界模型潜变量。

Skills, Retrieval, and Memory for Agent Workflows

13 minute read

Published:

TL;DR: this round is about agent-facing state. SkillOS learns how to curate reusable skills from experience. SIRA compresses retrieval into one corpus-discriminative lexical action. HaM-World gives planning a structured latent with memory and geometry so rollouts do not fall apart as quickly.

让中间产物能被智能体真正使用

4 minute read

Published:

TL;DR:本期看的是“中间产物”能不能被下游系统真正消费。TraceLift 讨论推理轨迹是否应该按它对冻结执行器的帮助来奖励;BRIGHT-Pro 讨论检索器是否应该覆盖一组互补证据,而不是只找一个相关段落;Agentic-imodels 则把可解释模型重新定义为“智能体读得懂、能模拟”的工具。

Intermediate Work That Agents Can Actually Use

13 minute read

Published:

TL;DR: this round is about intermediate work that another system has to consume. TraceLift asks whether a reasoning plan should be rewarded only when it helps a frozen executor. BRIGHT-Pro asks whether retrieval should cover an evidence portfolio rather than one relevant passage. Agentic-imodels asks whether an interpretable model should be readable by an agent, not only by a human analyst.

从搜索轨迹到记忆电路,再到世界理论

5 minute read

Published:

TL;DR:本期我想看的是智能体给出答案之前已经学到的结构。OpenSeeker-v2 讨论高质量、长难度搜索轨迹能把纯 SFT 搜索智能体推到什么程度;agent memory circuit 论文把写入、管理、读取三个记忆阶段拆开做电路追踪;Learning-to-Theorize 则把 world model 从“预测下一帧”推向“从观察中归纳可执行理论”。

Training Signals, Memory Circuits, and Theories of the World

17 minute read

Published:

TL;DR: this round is about structure that is learned before an agent produces the final answer. OpenSeeker-v2 asks how much frontier search-agent behavior can come from carefully filtered SFT trajectories. The agent-memory circuit paper opens the write-manage-read loop and shows that routing, extraction, and grounding emerge at different model scales. Learning-to-Theorize pushes world models away from pure prediction and toward executable, compositional theories inferred from raw observations.

先看清证据,再让模型回答

6 minute read

Published:

TL;DR:本期没有退回 4 月 30 日,而是保留 5 月 3-4 日的新论文,主题是“模型在回答或生成之前应该先看什么”。FlexSQL 让 data agent 在推理过程中反复检查 schema、取值、执行结果和计划,而不是一次性把 schema retrieval 固定下来。Chart-FR1 把密集图表推理训练成显式视觉聚焦过程,让 reasoning step 绑定 OCR 文本和局部区域。PV-VAE 则把 video VAE 从纯重建改成预测式重建,迫使 latent 携带运动和未来变化信息。

Agents That Look Before They Answer

18 minute read

Published:

TL;DR: this round stayed inside the fresh May 3-4 window and picked three papers about giving models a better inspection step before they answer or generate. FlexSQL lets a data agent revisit schemas, values, execution results, and plans instead of freezing retrieval up front. Chart-FR1 trains chart reasoning around explicit visual focus, so dense charts are not treated as one undifferentiated image. PV-VAE changes video VAE training from pure reconstruction to predictive reconstruction, pushing latents to carry motion-relevant structure rather than only pixel detail.

把证据从提示词里移出来

6 minute read

Published:

TL;DR:5 月 1 日到 3 日目标主题里的 arXiv 新稿很少,所以本期在去重后扩展到 4 月 30 日最新窗口。我选了三篇都在把证据移出 prompt 的论文:Claw-Eval-Live 用动态需求信号刷新 workflow-agent 评测,并按可观察动作打分;ObjectGraph 把文档变成可遍历的有类型图,而不是整段注入上下文;CIRM 在推理时干预 reward model 激活,降低格式捷径变成训练标签的风险。

Evidence Surfaces for Agents Outside the Prompt

16 minute read

Published:

TL;DR: the newest May 1-3 arXiv window was thin for the tracked topics, so I expanded to the freshest April 30 papers after deduplicating the existing Paper Radar list. I picked three papers that all move evidence out of the prompt: Claw-Eval-Live refreshes workflow-agent evaluation from live demand signals and grades observable action, ObjectGraph treats documents as traversable typed graphs rather than injected strings, and CIRM edits reward-model activations at inference time so style shortcuts are less likely to become training labels.

在漂移变成答案之前加检查

5 minute read

Published:

TL;DR:这一期我关注的是系统在“看起来已经完成”之前,能不能先发现或修正漂移。5 月 1 日到 3 日目标方向里没有足够新的 arXiv CS 投稿,所以我扩展到 4 月 30 日最新窗口,选了四篇开放全文:PRISM 讨论多模态模型在 RLVR 前的预对齐,PhyCo 讨论视频生成里的物理属性控制,FCMBench-Video 讨论文档证据随时间展开时的评测,Latent Adversarial Detection 讨论多轮攻击意图在激活轨迹中的信号。

Checks Before Agents Drift

20 minute read

Published:

TL;DR: this round is about checks that happen before an agent or multimodal system drifts into a polished but wrong outcome. I found no fresh May 1-3 arXiv CS submissions in the target topics, so I expanded to the newest April 30 window and selected four open papers: PRISM for pre-aligning multimodal policies before RLVR, PhyCo for injecting physical priors into video generation, FCMBench-Video for document evidence over time, and Latent Adversarial Detection for activation-level multi-turn safety probes.

让智能体有可操作的结构化表面

5 minute read

Published:

TL;DR:这一期我关注的不是“再给智能体更多上下文”或“每一步都换成更强模型”,而是智能体工作时需要什么样的结构化操作表面。入选的四篇分别讨论:用共享工作板做大规模 web-to-table 数据抽取、用事件检测决定电脑使用智能体何时升级到大模型、用 trace 分析多智能体系统中的信息污染,以及把 reward model 的机理分析目标从词表投影换成 reward head 方向。

Operating Surfaces for Agents That Need to Stay Correct

16 minute read

Published:

TL;DR: this round is about agents needing structured operating surfaces, not just longer context or more calls to a stronger model. I picked four papers after excluding the recent seen list: a web-to-table multi-agent system with a shared workboard, a step-level cascade for computer-use agents, a trace-level study of information contamination, and a reward-model interpretability library. The first three sit in the newest April 29-30 window; I expanded one slot to April 28 for the mechanism paper because the freshest sparse-autoencoder item had already been covered.

给科学研究智能体更好的结构化接口

4 minute read

Published:

TL;DR:这期我看的不是“智能体又多会做题了”,而是科学研究智能体到底需要什么接口。四篇论文分别把方法演化图谱、领域 foundation model 调用、复杂文献发现、科学可视化 workflow 做成了更明确的结构。我的判断是:研究智能体真正缺的往往不是更长上下文,而是能被查询、调用、验证和回放的中间层。

Structured Interfaces for Scientific Agents

18 minute read

Published:

TL;DR: this round is about scientific agents needing better interfaces to knowledge, tools, and intermediate workflow state. I picked four papers that make that interface explicit: a method-evolution graph for AI scientists, a framework that lets language agents call domain foundation models, a benchmark for hard literature discovery, and a controlled comparison of scientific-visualization agent interaction styles.

让数据与界面智能体面对硬证据

3 minute read

Published:

TL;DR:这期我看的是“证据对象变硬”这件事。四篇论文都不满足于让智能体写一个像样的答案,而是继续追问:科学数据任务能不能被可执行 verifier 检查,表格问答能不能识别隐含预测意图,GUI 智能体能不能到达精确状态,LLM 生成的 reward 能不能在合适的训练阶段再部署。

Hard Evidence for Data and Workflow Agents

16 minute read

Published:

TL;DR: this round is about evaluation objects getting harder. The four papers I chose are not mainly asking whether an agent writes a plausible answer. They ask whether it can pass an executable scientific checker, infer a hidden predictive intent from a table, hit an exact GUI state, or deploy a reward only when the learner is competent enough for that reward to mean something.

为什么智能体需要显式中间层

3 minute read

Published:

TL;DR:这期我盯着同一件事。真正有意思的工作,不是把端到端系统再堆大一点,而是把中间层做显式:BEV token、latent CoT、概念流形、检索页。模型一旦有了可检查的中间表示,几何、控制和证据就都更容易对齐。

Why Agents Need an Explicit Middle Layer

14 minute read

Published:

TL;DR: this round keeps circling one idea. The useful papers are not just bigger end-to-end systems, they are systems that make the middle layer explicit: BEV tokens, latent reasoning, manifold groups, retrieval pages. That is where the model gets a handle on geometry, control, or evidence.

让长程智能体拥有可回放工作区

4 minute read

Published:

TL;DR: 本期关注一个更底层的问题:长程智能体的工作过程能不能被复现、恢复和审计。我选了四篇 4 月 30 日的新论文,因为它们分别把长程智能体里容易被忽略的层显式化:Synthetic Computers 构造用户级工作区,Crab 保存沙箱操作系统状态,Exploration Hacking 检验模型是否能通过控制探索来抵抗强化学习,COHERENCE 把图文交错文档理解变成可核验的对齐任务。

Replayable Workspaces for Long-Horizon Agents

19 minute read

Published:

TL;DR: This round is about agents whose work can be replayed, repaired, and audited. I selected four April 30 papers because they each make one hidden layer of long-horizon agency explicit: Synthetic Computers builds user-specific workspaces for productivity simulation, Crab checkpoints the OS state of agent sandboxes, Exploration Hacking tests whether RL can be resisted by strategic under-exploration, and COHERENCE turns interleaved document understanding into a verifiable alignment task.

从闭环反馈到可审计工作流

3 minute read

Published:

简短 TL;DR: 本期我把注意力从“智能体能不能刷过一个 benchmark”转到更难的问题:训练和工作流结束后,过程还能不能被核验。四篇论文分别给出四个入口:FutureWorld 用真实世界事件兑现后的结果做延迟奖励;AgentSim 生成可核验的 RAG 智能体轨迹;DV-World 把数据可视化智能体放进原生软件、跨框架演化和交互澄清流程;MoRFI 则从机制层看,后训练引入新事实时为什么可能损伤模型对旧知识的访问。

Closed Loops for Auditable Agents

17 minute read

Published:

TL;DR: This round moves from “can an agent solve a benchmark?” toward a harder question: can the loop be audited after the fact? I selected four recent open papers because they put feedback and evidence into the system boundary: FutureWorld waits for real outcomes before assigning rewards, AgentSim records verifiable RAG traces, DV-World tests data-visualization agents in native and interactive workflows, and MoRFI probes how post-training on new facts leaves sparse mechanistic traces.

让长程智能体拥有可核验的状态

4 minute read

Published:

简短 TL;DR: 本期关注一个很实用的变化:更强的智能体不只是更长上下文模型,而是能生成任务、核验中间状态、保存证据、并从世界状态变化中学习的系统。我选了四篇近期开放全文论文:ClawGym 做可执行、可核验的电脑使用任务;World2VLM 把世界模型想象变成训练信号;DataPRM 让数据分析奖励模型进入环境检查步骤;OCR-Memory 则把长程历史保存成可光学检索的证据。

Verifiable State for Long-Horizon Agents

16 minute read

Published:

TL;DR: This round is about a practical shift in agent research: stronger agents are not only longer-context models, but systems that synthesize tasks, verify intermediate state, preserve evidence, and learn from world-state transitions. I selected four recent open papers because each one makes state more inspectable: ClawGym builds verifiable computer-use tasks, World2VLM turns world-model imagination into training data, DataPRM verifies data-analysis steps inside the environment, and OCR-Memory stores long agent histories as optically retrievable evidence.