Agentic AI digest - 2026-06-05

A ranked brief from the day's arXiv listing. Cortiq weighs topical fit, lead-author context, and public research signals before the issue is published.

Agentic AI

1. CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments

2. Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows

3. Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads

4. Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems

5. EGTR-Review: Efficient Evidence-Grounded Scientific Peer Review Generation via Multi-Agent Teacher Distillation

6. Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

7. From Reward-Hack Activations to Agentic Risk States: Context-Calibrated Mechanistic Monitoring in LLM Agents

8. Harnessing Generalist Agents for Contextualized Time Series

9. Humans' ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration

10. AURA: Intent-Directed Probing for Implicit-Need Surfacing in Situated LLM Agents

11. Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving

12. From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents

13. When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

14. Beyond Similarity: Trustworthy Memory Search for Personal AI Agents

15. ToolChoiceConfusion: Causal Minimal Tool Filtering for Reliable LLM Agents

16. Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents

17. SHIELDS: Automating OS Hardening with Iterative Multi-Agent Remediation

18. The Self-Correction Illusion: LLMs Correct Others but Not Themselves

19. Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents

20. Unsupervised Skill Discovery for Agentic Data Analysis

21. CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement

22. Agent-Orchestrated Adaptive RAG: A Comparative Study on Structured and Multi-Hop Retrieval

23. Benchmark Everything Everywhere All at Once

24. Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents

25. AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

26. What Should Agents Say? Action-state Communication for Efficient Multi-Agent Systems

27. Ten Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison

28. Insurance of Agentic AI

29. ProSPy: A Profiling-Driven SQL-Python Agentic Framework for Enterprise Text-to-SQL

30. Latent Reasoning with Normalizing Flows

31. Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming

32. DexFuture: Hierarchical Future-State Visuomotor Targeting for Bimanual Dexterous Tool Use

33. AdaMEM: Test-Time Adaptive Memory for Language Agents

34. MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

35. When Evidence is Sparse: Weakly Supervised Early Failure Alerting in Dialogs and LLM-Agent Trajectories

36. PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models

37. MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA

38. Reducing Hallucinations in Complex Question Answering using Simple Graph-based Retrieval-Augmented Generation (long version)

39. Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene Generation

40. StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale, Multi-Genre and Auto-Generated Dataset

41. Merging model-based control with multi-agent reinforcement learning for multi-agent cooperative teaming strategies

42. How Far Did They Go? The Persuasive Tactics of Covert LLM Agents in a Discontinued Field Experiment

43. Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts

44. When Denser Credit Is Not Enough: Evidence-Calibrated Policy Optimization for Long-Horizon LLM Agent Training

45. ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

46. LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents

47. Membrane: A Self-Evolving Contrastive Safety Memory for LLM Agent Defense

48. RedEdit: Agentic Red-Teaming of Image Safety Classifiers via MCTS-Guided Photo-Editing

49. Brick-Composer: Using MLLMs for Assembly with Diverse Bricks

50. LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization

51. PSEBench: A Controllable and Verifiable Benchmark for Evaluating LLMs in Patient Safety Event Triage

52. SciVisAgentSkills: Design and Evaluation of Agent Skills for Scientific Data Analysis and Visualization

53. PerceptUI: LLM Agents as Human-Aligned Synthetic Users for UI/UX Evaluation

54. TAPO: Tool-Aware Policy Optimization via Credit Transfer for Multimodal Search Agents

55. Entropy-Based Evaluation of AI Agents: A Lightweight Framework for Measuring Behavioral Patterns

56. State commitment learning: training language models to distinguish computation from memory

57. Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents

58. AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents

59. RREDCoT: Segment-Level Reward Redistribution for Reasoning Models

60. LoRi: Low-Rank Distillation for Implicit Reasoning

61. QueryAgent-R1: Bridging Query Generation and Product Retrieval for E-Commerce Query Recommendation

62. TARPO: Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization

63. Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

64. Emergent Language as an Approach to Conscious AI

65. ZERO-APT: A Closed-Loop Adversarial Framework for LLM-Driven Automated Penetration Testing under Intelligent Defense

66. GenTI: Benchmarking LLMs for Autonomous IDPS Rule Generation for Unseen Attacks

67. WebMCP Tool Surface Poisoning: Runtime Manipulation Attacks on LLM Agents

68. Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny Signals

69. VTI-CoT: Visual-Textual Interleaved Chain of Thought for Video Reasoning

70. Beyond Absolute Scores: Relative Edit-induced Difference for Generalizable Image Aesthetic Assessment

71. ANCHOR: Agentic Noise Creation Framework for Human Simulation and Denoising Recommendation

72. OneReason Technical Report

73. Dynamic Multi-Agent Pickup and Delivery in Robotic Cellular Warehousing Systems

74. Learning of Robot Safety Policies via Adversarial Synthetic Scenarios

75. RiskFlow: Fast and Faithful Safety-Critical Traffic Scenario Generation

76. Flow-based Policy Adaptation without Policy Updates

77. HANDOFF: Humanoid Agentic Task-Space Whole-Body Control via Distilled Complementary Teachers

78. SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations

79. Agentic Molecular Recovery via Molecule-Aware Exploration