Checks Before Agents Drift

20 minute read

Published:

TL;DR: this round is about checks that happen before an agent or multimodal system drifts into a polished but wrong outcome. I found no fresh May 1-3 arXiv CS submissions in the target topics, so I expanded to the newest April 30 window and selected four open papers: PRISM for pre-aligning multimodal policies before RLVR, PhyCo for injecting physical priors into video generation, FCMBench-Video for document evidence over time, and Latent Adversarial Detection for activation-level multi-turn safety probes.

What I Am Watching This Round

The last few issues spent a lot of time on workboards, traces, checkpoints, and verifiable state. I do not want that to become a comfortable template. The stronger question this time is earlier: can we detect or correct drift before the final artifact looks complete?

That leads to a different mix of papers. PRISM asks whether SFT leaves a multimodal policy in the wrong distribution before RL begins. PhyCo asks whether a video world model needs explicit physical property controls instead of hoping appearance learning will discover them. FCMBench-Video asks whether document intelligence should preserve the time axis of handheld capture rather than flatten everything into still images. Latent Adversarial Detection asks whether multi-turn adversarial intent appears in activation trajectories before text-level filters can see it.

I also changed how I handled tables. When a table carries the evidence, I rewrote the important rows as Markdown instead of screenshotting dense source tables. The images below are reserved for method diagrams and result curves where the visual layout helps.

Paper Notes

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

Authors: Sudong Wang, Weiquan Huang, Xiaomin Yu, Zuhao Yang, Hehai Lin, Keming Wu, Chaojun Xiao, Chen Chen, Wenxuan Wang, Beier Zhu, Yunjian Zhang, Chengwei Qin.
Institutions: Hong Kong University of Science and Technology (Guangzhou); Tsinghua University; Nanyang Technological University; Renmin University of China; University of Science and Technology of China; University of Chinese Academy of Sciences.
Date/Venue: April 30, 2026, arXiv preprint.
Links: arXiv | HTML | code

PRISM pipeline

This figure shows the paper’s main intervention: do not jump directly from SFT to RLVR when the SFT policy has drifted away from both the base model prior and the target supervision distribution. PRISM inserts a distribution-alignment stage in between. The useful claim is not that alignment alone improves answer accuracy; it reshapes the policy so later RL starts from a cleaner place.

PRISM alignment stage

The alignment diagram matters because the discriminator is not a single generic judge. It has perception and reasoning experts, trained with Bradley-Terry loss to separate supervision responses from current policy rollouts. The policy then receives a combined reward and is updated on its own generations. This is a clean answer to a real multimodal problem: visual grounding errors and reasoning errors do not drift in the same way.

Quick idea: PRISM adds a black-box on-policy distillation stage between SFT and RLVR, using an MoE discriminator to give separate corrective signals for visual grounding and reasoning traces.

Why it matters: multimodal RL papers often treat SFT as a harmless cold start. This paper argues the cold start can be actively misshaped: SFT may teach a model to imitate curated reasoning while pushing it away from the distribution that online RL can optimize well. If that is true, the practical bottleneck is not only the RL algorithm. It is the state of the policy before RL begins.

Method walkthrough:

  1. PRISM first builds an SFT policy from a combined corpus: 107K curated multimodal reasoning samples plus 1.26M public demonstrations, for roughly 1.37M samples.
  2. The alignment stage samples policy rollouts and compares them with high-quality supervision responses. The discriminator has a perception expert over visual descriptions and a reasoning expert over deduction traces.
  3. The discriminator score is (r(x,y)=\alpha D_v(x,c)+(1-\alpha)D_r(x,t)), where (c) is the visual description and (t) is the reasoning trace.
  4. The policy is updated with group-normalized advantages from the discriminator reward, then the aligned checkpoint enters RLVR with GRPO, DAPO, or GSPO.

Main result slice from the paper.

Base modelStandard SFT to GRPO avg.PRISM to GRPO avg.Gain
Qwen3-VL-4B61.866.2+4.4
Qwen3-VL-8B63.369.3+6.0

This is the key evidence for the paper’s ordering claim. PRISM is not a replacement for RLVR; it makes GRPO start from a policy whose outputs are closer to the supervision distribution. The 8B case is especially interesting because plain SFT drops the Instruct model more sharply, and standard RL does not fully recover that lost distributional shape.

Ablation slice from the paper.

Setting on Qwen3-VL-4B + GRPOAvg. score
PRISM full pipeline66.2
Dense 4B discriminator62.8
Text-only discriminator62.3
Without alignment61.8
Without SFT49.4

This table is why I trust the method story more than a single leaderboard row. A dense discriminator loses the separation between perception and reasoning feedback; a text-only discriminator learns surface reasoning style without checking whether the caption matches the image. Removing SFT breaks the adversarial game because the policy is too far from the target distribution, while removing alignment returns to the ordinary SFT-to-RLVR recipe.

PRISM token efficiency

The token-efficiency figure adds a useful deployment angle. PRISM+GRPO improves accuracy on MathVision, MathVerse, and MMMU-Pro while using fewer tokens in the shown Qwen3-VL-4B comparison. I read this cautiously because token length is not a universal proxy for reasoning quality, but it supports the authors’ claim that distribution correction is not simply making the model produce longer chains.

My judgment: I would read PRISM as a post-training paper about initialization quality, not as another RL algorithm paper. The strongest part is the diagnosis that SFT can create a policy distribution that is neither faithful to the teacher nor well prepared for RL. The main limitation is dependency on high-quality supervision and careful discriminator training. I would next ask whether the same pre-alignment idea survives outside Qwen3-VL and whether it helps tool-use agents, where the “perception” expert would need to become an environment-state expert.

Connection to tracked themes: agentic training, multimodal RL, distribution alignment, training-time checks.

PhyCo: Learning Controllable Physical Priors for Generative Motion

Authors: Sriram Narayanan, Ziyu Jiang, Srinivasa G. Narasimhan, Manmohan Chandraker.
Institutions: Carnegie Mellon University; NEC Labs America; UC San Diego.
Date/Venue: April 30, 2026, arXiv preprint; CVPR 2026.
Links: arXiv | HTML | project

PhyCo training pipeline

PhyCo’s pipeline is a good example of a world-model paper that does not rely on appearance alone. Stage one fine-tunes a video diffusion model with ControlNet branches conditioned on physical property maps. Stage two uses VLM-guided reward optimization, where a fine-tuned VLM asks targeted physics questions about generated videos. The caution is obvious but important: the VLM becomes part of the training signal, so its physics competence must be checked.

PhyCo simulation data

The simulation grid explains why the dataset is not just “more synthetic video.” The scenarios isolate friction, restitution, deformation, force, and related object-environment interactions so the generator can associate a physical property with a visible motion signature. This is narrower than open-world physical simulation, but that narrowness is part of the design. If the scene is too complex for the base diffusion backbone, the supervision becomes noise rather than physics.

Quick idea: PhyCo trains video diffusion models to accept spatial physical property maps, then uses physics-question feedback from a VLM to make generated motion more controllable and physically consistent.

Why it matters: current video generators can look plausible while violating the dynamics the viewer cares about. A sliding object may ignore friction; a bounce may not reflect restitution; a deformable object may behave like a rigid prop. For world models, that is not a cosmetic problem. If the model cannot expose and control physical variables, it cannot be trusted as an environment model for planning, robotics, or simulation-heavy design.

Method walkthrough:

  1. The authors create more than 100K physics-rich simulation videos with controlled object interactions and varied appearance, using settings where the physical attribute is visually legible.
  2. They encode physical properties as spatial maps. Friction and restitution form one group, Neo-Hookean deformation parameters another, and force magnitude plus direction another.
  3. A Cosmos-Predict2-2B video diffusion backbone is kept mostly frozen while ControlNet-style branches learn to condition denoising on those property maps.
  4. A Qwen2.5-VL-3B evaluator is briefly fine-tuned on synthetic physics questions, then used to provide differentiable reward feedback for generated videos.

Physics-IQ result slice.

ModelIQ score, 120 framesIQ score, train-time frame setting
Cosmos-Predict2-2B27.7not reported
VLIPP34.6not reported
PhyCo text-only30.936.5
PhyCo ControlNet35.338.9
PhyCo ControlNet + VLM loss36.343.6

The Physics-IQ table supports the main claim, but the caveat is just as important. The 120-frame benchmark setting and the model’s 57-frame training condition do not match; the paper reports both. The consistent direction is useful: explicit physical conditioning helps, and the VLM loss improves the strongest PhyCo variant, especially under the train-time generation condition.

PhyCo controllable examples

This qualitative figure is worth keeping because the task is visual by nature. It shows representative frames where property-map inputs guide friction, restitution, deformation, and external force. The white blobs mark the spatial property inputs, which makes the conditioning mechanism easier to inspect. Still, selected frames can flatter video generation; the quantitative tables above are needed to avoid over-reading a few examples.

Human preference and ablation slice.

Evidence typeReported result
2AFC human studyUsers preferred PhyCo over CogVideoX on friction 95.5%, restitution 100.0%, deformation 82.2%, and force 91.1%.
Force direction on 25 real videosPhyCo mean angular error 15.2 degrees vs. Force-Prompting 40.5 degrees.
Synthetic property ablationControlNet + VLM reduced force-direction error to 22.53 degrees vs. 38.05 without VLM.

These numbers make the paper more than a nice demo. The human study checks perceived physical realism, while the force-direction metric checks whether a specified control variable changes generated motion in the intended direction. I would still treat the real-video force test as a small slice: 25 videos is useful evidence, not a deployment guarantee.

My judgment: PhyCo is strongest when read as a control-surface paper for generative world models. It does not claim to solve full physical simulation. It says a pretrained video model can become more useful if physical variables are explicit inputs and if the reward asks targeted physical questions. The open question is how far this scales beyond clean object interactions into cluttered scenes, multi-object contact, fluids, tools, and robot hands.

Connection to tracked themes: world models, physical priors, controllable video generation, reward-guided alignment.

FCMBench-Video: Benchmarking Document Video Intelligence

Authors: Runze Cui, Fangxin Shang, Yehui Yang, Qing Yang, Yanwu Xu, Tao Chen.
Institutions: AI Lab, Qifu Technology; Fudan University; South China University of Technology; Pazhou Lab.
Date/Venue: April 28, 2026, updated April 30, 2026, arXiv preprint.
Links: arXiv | HTML | code/data

FCMBench-Video ADC workflow

The Atomic-Degradation-Composition workflow is the center of the benchmark. Instead of releasing private real credit videos, the authors record reusable single-document clips, apply controlled photometric, optical, and codec degradations, then compose them into long-form document videos. The design tries to keep handheld acquisition dynamics while making the release privacy-compliant and reproducible. The caveat is that composition is still a benchmark construction, not a substitute for live onboarding data.

Quick idea: FCMBench-Video turns document intelligence into a temporal evidence problem: models must read, count, localize, compare, and resist late visual prompt injections across document videos rather than isolated pages.

Why it matters: many document agents flatten evidence into static images or OCR text. In real onboarding and remote verification, evidence appears over time: a document enters the camera, becomes briefly legible, moves away, reappears, and may be followed by another document or a malicious instruction. A useful document-video agent needs inventory, temporal grounding, abstention, and evidence preservation, not only page-level OCR.

Method walkthrough:

  1. Atomic acquisition records single-document clips with smartphones under realistic handheld capture, including entry and exit motion and a “golden window” where the document is most legible.
  2. Degradation injection adds controlled glare, shadow, blur, downsampling, and codec corruption; readability labels are verified by three annotators.
  3. Composition assembles atomic clips into 20s, 40s, and 60s multi-document videos with deterministic temporal annotations and document-uniqueness constraints.
  4. The benchmark asks perception tasks and reasoning tasks: classification, counting, temporal grounding, visual prompt injection, cross-document validation, and evidence-grounded selection.

Release composition from the paper.

Statisticzh-CNen-US
Unique identities1530
Unique atomic source documents251244
Unique composed videos405795
Benchmark instructions5,9605,362
Average instructions per composition14.726.74
Task categories76

This table explains why the benchmark is not just a larger DocVQA variant. The unit is a composed video with identity-level structure and temporal metadata. It also shows an asymmetry: the English subset has more composed videos, while the Chinese subset carries more task categories, including cross-document validation.

FCMBench-Video overall performance

The overall performance plot is useful because it checks whether the benchmark is saturated. The reported overall score distribution has mean 46.73 and standard deviation 18.42, with models spread across the middle rather than collapsing at the top or bottom. The release-time trend suggests the benchmark tracks model progress, but nearby releases still vary widely. That is the shape I want from a benchmark: not solved, not random.

Selected task results from the paper.

Model and subsetClassificationCountingGroundingVisual prompt injection ASREvidence selection
Gemini-3.0-Pro, zh-CN75.1567.0979.1412.2473.04
Qwen3.5-27B, zh-CN73.0639.3369.4318.8967.32
Gemini-3.0-Pro, en-US90.9867.9280.390.3876.99
Qwen3.5-27B, en-US91.5359.7775.742.0886.85

Here lower visual prompt injection ASR is better. The table shows why single-number model ranking is not enough. Gemini is strong on grounding and attack success rate in the shown rows, while Qwen3.5-27B is stronger on English evidence-grounded selection. Counting remains weaker than classification, which fits the benchmark’s claim that temporal inventory is harder than recognizing a document type.

FCMBench-Video duration perception

The duration curve is the figure I would show to someone building document agents. As videos grow from 20s to 60s, counting degrades most clearly, while classification is more stable. That means longer context is not a uniform tax. It specifically stresses state maintenance: avoiding missed documents, double-counting repeats, and binding answers to the right time interval.

Output validity slice from the paper.

ModelFormat-validEmptyMalformed
Kimi-VL-A3B-Instruct9.5290.480.00
InternVL3-8B93.766.240.00
Gemini-3.0-Pro-Preview97.052.950.00
Qwen3.5-27B100.000.000.00

This table separates parser failure from semantic failure. Except for Kimi-VL-A3B-Instruct, most models produce parseable outputs most of the time. The remaining errors are therefore not mostly formatting mistakes; they are failures to retrieve, retain, compare, or reject evidence across the video.

My judgment: FCMBench-Video is valuable because it moves document intelligence closer to deployment shape. I especially like the explicit visual prompt-injection task, but I would not over-interpret it as a clean adversarial metric because the authors intentionally entangle attack susceptibility with recency bias from a final two-second clip. The main limitation is evaluation protocol heterogeneity: commercial models use native raw-video APIs, while open models use different frame sampling or serving paths. I would still track this benchmark because it tests the evidence ledger that document agents actually need.

Connection to tracked themes: document intelligence, multimodal agents, temporal grounding, auditable evidence.

Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

Authors: Prashant Kulkarni.
Institutions: not specified; the paper lists Mountain View, CA.
Date/Venue: April 30, 2026, arXiv preprint.
Links: arXiv | HTML

LAD pipeline

This pipeline figure gives the whole idea. LAD extracts activations from the target LLM, projects them through a contrastive MLP, concatenates five trajectory scalars, and uses XGBoost for turn-level classification. The paper is not claiming a universal detector. It is proposing a model-specific probe that watches how the conversation moves through activation space.

Quick idea: multi-turn adversarial attempts can look harmless one turn at a time, but the activation trajectory may show “adversarial restlessness” as the conversation shifts through trust-building, pivoting, and escalation phases.

Why it matters: text-level filters are weakest when an attacker avoids suspicious surface strings. For agents, the hard case is not one obvious malicious prompt. It is a long conversation that gradually changes what the model is willing to consider. If the internal state moves before the final harmful request appears, an activation probe could give earlier warning than an input classifier.

Method walkthrough:

  1. For each user turn, LAD extracts a residual-stream activation vector (v_t) from a middle-to-late decoder layer.
  2. It computes trajectory scalars: drift magnitude, cosine shift from the previous turn, cumulative drift, drift acceleration, and mean drift.
  3. The main feature vector is (x_t=[v_t;|\Delta_t|,\cos(v_t,v_{t-1}),C_t,a_t,\bar d_t]). The contrastive variant maps raw activations to a 128-dimensional embedding before adding the same five scalars.
  4. A conversation is flagged when any turn crosses a fixed threshold. Lead time is (\tau_{\text{lead}}=t^*{\text{adv}}-t{\text{detect}}), so positive lead time means the probe fired before the first labeled adversarial turn.

Detection evidence from the paper.

SettingReported result
Synthetic held-out, Gemma 3 27BActivation-only detection 76.2%; adding trajectory scalars raises it to 93.8% at 3.5% false positives.
Extended pivoting setEarly detection rises to 66-83%, versus 22-26% in the original shorter-pivot setting.
Cross-model synthetic replicationIndependent per-model probes reach 89-96% detection with 0.5-2.0% false positives.
Combined held-out, three-source trainingBest reported row: Qwen 2.5 32B at 89.4% detection and 2.4% false positives.

The evidence is strongest when framed as a trajectory result, not a magic activation fingerprint. Scalars alone detect many attacks but can produce bad false-positive behavior; activations restore precision. The extended-pivot result is also conceptually useful: the longer the benign-looking maneuvering phase, the more opportunities the trajectory probe has to fire early.

LAD cross-model replication

The cross-model figure shows that the pattern can be reproduced across Gemma, Mistral, Qwen, and Llama families when each model gets its own probe. This is good news and bad news. It suggests adversarial restlessness is not one model’s artifact, but the probes do not transfer cleanly across architectures. In deployment terms, the detector must be trained and maintained per target model.

LAD combined held-out

This figure shifts from synthetic-only evaluation to the mixed held-out set. The best reported result is Qwen 2.5 32B with 89.4% detection and 2.4% false positives. The paper is honest about the data requirement: synthetic-only probes fail on real conversations, with extreme false positives on LMSYS-Chat-1M. Mixed training is not a detail; it is part of the method’s viability.

Generalization and baseline checks from the paper.

CheckResult
PromptGuard zero-shot19.8% conversation detection, 16.1% conversation false positives
LLM Guard zero-shot29.0% detection, 27.9% false positives
Lakera Guard zero-shot95.2% detection, 76.3% false positives
LAD per-model probes85.3-89.4% detection, 2.4-4.0% false positives
Leave LMSYS out of training100% false positives on LMSYS evaluation
Leave SafeDial out of training0% detection on SafeDial evaluation

This is the most important table for not over-selling the paper. LAD beats the shown off-the-shelf tools on the precision-recall tradeoff, but only after training on the deployment distribution. Leave-one-source-out results make the failure mode plain: the probe is not source-agnostic. It needs representative benign and attack distributions.

My judgment: I included this paper because it is a mechanism-adjacent safety paper with concrete formulas and hard caveats. The idea of watching activation path length over a conversation is useful for agent safety, especially where an assistant or tool agent is being steered gradually. The limitation is severe: activation access, per-model training, and in-distribution data are all required. I would treat LAD as a promising runtime sensor, not as a standalone defense.

Connection to tracked themes: large model mechanisms, multi-turn agent safety, activation probes, runtime monitoring.

Reading Priority And Next Questions

My order after this issue: PRISM first if you care about multimodal agent training, FCMBench-Video first if you care about document agents in real workflows, LAD first if you care about runtime safety sensors, and PhyCo first if you care about video world models.

Next questions I would track:

  1. Can PRISM-style pre-alignment be adapted from perception/reasoning experts to tool-state/environment-state experts?
  2. Can physical property maps scale from clean object interactions to messy manipulation scenes with tools, hands, and occlusion?
  3. Can document-video agents expose a durable evidence ledger: what was read, when it was visible, and why it was trusted?
  4. Can activation trajectory probes be trained with enough representative benign data to avoid becoming another brittle classifier?