Curated academic papers that inform our research directions.
Note: These are external publications from arXiv, not SparseTech publications. We share them as context for the mathematical foundations underlying our work.
Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models
Guangzhao He, Rundong Luo, Wei-Chiu Ma +1 more
Published: June 1, 2026
cs.CV
Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigate whether pretrained vision-language models (VLMs) can perform executable inverse graphics directly from a single image…
Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling
Seojeong Park, Jiho Choi, Junyong Kang +3 more
Published: June 1, 2026
cs.CVcs.AI
Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains limited by a critical weakness: when visual evidence conflicts with textual cues, MLLM judges tend to reward plausible narratives over perceptually correct answers. We identify and…
RoboDream: Compositional World Models for Scalable Robot Data Synthesis
Junjie Ye, Rong Xue, Basile Van Hoorick +6 more
Published: June 1, 2026
cs.ROcs.CV
Scaling robot learning requires large-scale, diverse demonstrations, yet real-world data collection via teleoperation remains prohibitively expensive and time-consuming. While video diffusion models offer a promising avenue for data scaling, existing generative approaches are often limited to superficial visual…
ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning
Yu-Cheng Shi, Zhen-Hao Xie, Jun-Tao Tang +1 more
Published: June 1, 2026
cs.CVcs.LG
Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually acquire new vision-language capabilities, making Multimodal Continual Instruction Tuning (MCIT) essential. To reduce inter-task interference and promote collaboration,…
From Zero to Hero: Training-Free Custom Concept Spawning in World Models
Kiymet Akdemir, Pinar Yanardag
Published: June 1, 2026
cs.CV
Autoregressive world models have emerged as a powerful paradigm for interactive video generation, allowing users to navigate dynamically generated environments through actions. These models are typically conditioned on a text prompt and/or a single reference frame, from which the entire world is generated. Yet the…
HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image
Hezhen Hu, Wangbo Zhao, Lanqing Guo +6 more
Published: June 1, 2026
cs.CV
In this paper, we present HumanNOVA, a photorealistic, universal, and rapid model for generating 3D human avatars from a single RGB image. Achieving both photorealism and generalization is challenging due to the scarcity of diverse, high-quality 3D human data. To address this, we build a scalable data generation…
VISReg: Variance-Invariance-Sketching Regularization for JEPA training
Haiyu Wu, Randall Balestriero, Morgan Levine
Published: June 1, 2026
cs.CV
Self-supervised learning methods prevent embedding collapse via modeling heuristics or explicit regularization of the embedding space. Among the latter, VICReg decomposes regularization into variance and covariance objectives, offering flexibility and interpretability. However, covariance captures only second-order…
AdaCodec: A Predictive Visual Code for Video MLLMs
Haowen Hou, Zhen Huang, Zheming Liang +8 more
Published: June 1, 2026
cs.CVcs.AIcs.CL
Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video MLLMs) usually encode each sampled frame as an independent RGB image, causing visual tokens to repeat content already present in earlier frames. This suggests a…
ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents
Yuxing Lu, Yushuhong Lin, Wenqi Shi +4 more
Published: June 1, 2026
cs.AIcs.CLcs.ET
Clinical practice is not the selection of an answer from enumerated options: a physician gathers heterogeneous information incrementally and commits to sequential, irreversible decisions under uncertainty. Static benchmarks cannot probe and existing interactive medical benchmarks each compromise on at least one of…
Strong Polarization and Entropy
Daniel Galicer, Oscar Ortega-Moreno, Damián Pinasco
Published: June 1, 2026
math.FAcs.IT
We show that for any set of $n$ unit vectors $v_1,\ldots,v_n$ in a real Hilbert space and positive numbers $p_1,\ldots,p_n$ satisfying $\sum_j p_j = 1$, there exists a unit vector $u$ such that
\[
\sum_{j=1}^n \frac{p_j^2}{\langle v_j, u\rangle^2}\leq 1.
\]
This inequality is a weighted version of the strong…
Policy-based Foveated Imaging and Perception
Howard Xiao, Jan Ackermann, Boyang Deng +1 more
Published: June 1, 2026
cs.CV
Ultra-high-resolution image sensors offer the potential to capture fine spatial details critical for many visual perception tasks, but acquiring and processing all pixels at full resolution is often infeasible under realistic bandwidth, latency, and power constraints. Existing approaches address this challenge through…
VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization
Junhao Cheng, Liang Hou, Tianxiong Zhong +4 more
Published: June 1, 2026
cs.CV
The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often struggle to understand and follow task-specific rules, leading to logical failures across…
IntraShuffler: A Privacy Preserving Framework for Heterogeneous DP Federated Learning
Farhin Farhad Riya, Olivera Kotevska, Jinyuan Stella Sun
Published: June 1, 2026
cs.LGcs.CRcs.DC
Heterogeneous Differential Privacy (HDP) in Federated Learning (FL) allows clients to select individual privacy budgets ($\varepsilon_i$) according to institutional policies and data sensitivity. In practice, many HDP-FL systems employ $\varepsilon$-aware server aggregation to improve model utility by re-weighting…
Permissive Safety Through Trusted Inference: Verifiable Belief-Space Neural Safety Filters for Assured Interactive Robotics
Haimin Hu
Published: June 1, 2026
cs.ROcs.AIcs.LG
Autonomous robots that interact with people must make safe and efficient decisions under human-induced uncertainty, such as their preferences, goals, competency, and willingness to cooperate. Safety filters are a popular approach for ensuring safety in interactive robotics, since their modular design separates safety…
From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression
Elia Cunegatti, Marcus Vukojevic, Erik Nielsen +1 more
Published: June 1, 2026
cs.CLcs.AI
Post-training compression of Large Language Models (LLMs) removes entire architectural components, either deleting them or replacing them with fitted modules. Existing replacement-based methods share two design constraints: full-layer granularity and contiguous selection. We argue that this is overly restrictive: in…
HERO'S JOURNEY: Testing Complex Rule Induction with Text Games
Anshun Asher Zheng, Kanishka Misra, David I. Beaver +1 more
Published: June 1, 2026
cs.CL
We introduce HERO'S JOURNEY, a benchmark for rule induction in goal-directed episodic tasks, where agents must infer hidden rules from demonstrations and act on them through multi-step execution. HERO'S JOURNEY covers eight tasks across attribute and procedural induction families, each with four structural rule forms,…
LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation
Qixin Hu, Shuai Yang, Wei Huang +2 more
Published: June 1, 2026
cs.CV
Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during generation. This creates an irreversible generation trajectory: once the active…
Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation
Siyuan Bian, Congrong Xu, Jun Gao
Published: June 1, 2026
cs.CVcs.AI
Despite advances in depth estimation, flying points remain a persistent failure mode: near object boundaries, depth estimators often predict spurious 3D points in the empty space between foreground and background surfaces. We trace this artifact to a standard modeling choice: assigning each pixel a single depth…
AFUN: Towards an Affordance Foundation Model for Functionality Understanding
Zhaoning Wang, Yi Zhong, Jiawei Fu +2 more
Published: June 1, 2026
cs.ROcs.CV
Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only understands where and how the interaction should happen, but also generalizes…
SN-WER: Script-Normalized WER for Multi-Script Indic ASR Evaluation
Priyaranjan Pattnayak
Published: June 1, 2026
cs.CL
Word Error Rate (WER) is the dominant metric for automatic speech recognition (ASR), but it can overestimate errors when references and hypotheses encode the same words in different scripts. This issue is common in multilingual settings where ASR models may emit romanized text. We propose Script-Normalized WER…
Transferable Self-Harm Surveillance from Emergency Department Triage Notes Using an Evidence-Augmented Machine Learning Approach
Liuliu Chen, Gowri Rajaram, Eleanor Bailey +5 more
Published: June 1, 2026
cs.CL
Self-harm is a major public health concern, but current surveillance relying on hospital presentations is inadequate due to the low sensitivity of diagnostic codes. Emergency Department (ED) triage notes, recorded at the initial point of contact, provide a succinct summary of presentations and an opportunity to…
SimSD: Simple Speculative Decoding in Diffusion Language Models
Junxia Cui, Haotian Ye, Runchu Tian +9 more
Published: June 1, 2026
cs.CLcs.AI
Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) LLMs, offering faster inference through parallel or blockwise decoding. However, their masked language modeling formulation remains incompatible with standard token-level speculative decoding, one of the most…
SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction
Yuting Ning, Zhehao Zhang, Yash Kumar Lal +8 more
Published: June 1, 2026
cs.CL
Agent skills occupy a privileged position in the agent workflow, as agents are expected to implicitly follow and execute them, rendering third-party skills a vulnerable attack surface. Existing studies have revealed unsafe agent behaviors induced by skill-based attacks, but they primarily evaluate poisoned skills…
Tracking the Behavioral Trajectories of Adapting Agents
Jonah Leshin, Manish Shah, Ian Timmis
Published: June 1, 2026
cs.AI
Text files such as skill files, memory files, and behavioral configuration files play a central role in defining how modern agents act. Through edits by humans or the agents themselves, these files may evolve over time, directly steering the agent's behavior in future interactions. We present a methodology and…