Workshop Program – INFOCOM 2026 Workshop on Embodied Intelligence Networks (EIN)

2:00 p.m. - 2:05 p.m.

Opening

Welcome and Workshop Introduction

Baochun Li (General Chair), Edith C. H. Ngai, Ningxin Su, and Hao Wang (Program Co-Chairs)

2:05 p.m. - 3:00 p.m.

Keynote

Keynote talk

Speaker:György Dán (KTH Royal Institute of Technology, Stockholm, Sweden)

Title:From Pixels to Policies: Securing the Perception–Action Loop in Networked Embodied Intelligence

Abstract:As networked embodied systems increasingly rely on machine learning for perception and decision-making, vulnerabilities to adversarial manipulation pose a fundamental challenge to their safe and reliable operation. In such systems, intelligence emerges from tightly coupled perception–action loops, where attacks at the sensory level can propagate through policies and disrupt coordinated behavior at the system level. This talk examines how to secure this loop—from pixels to policies—by developing mechanisms that go beyond static robustness. At the level of individual agents, we present recent advances in making core computer vision tasks, including object detection and classification, resilient to physically realizable adversarial attacks. These approaches combine ensemble saliency analysis with reconstruction-based defenses to enable real-time detection while maintaining strong performance under attack. At the level of networked multi-agent systems, we study how agents can achieve situational awareness by identifying compromised or anomalous peers in a fully decentralized manner, framing the problem as a sequential hypothesis testing problem. We finally discuss the problem of achieving robustness to adversarial behavior in a multi-agent setting in a Bayesian framework. Together, these results point to a broader principle for embodied intelligence: ensuring resilience in real-world systems requires closing the loop with capabilities for detection, adaptation, and coordinated response to adversarial inputs across both perception and policy layers.

Biography: György Dán is professor of teletraffic systems at KTH Royal Institute of Technology, Stockholm, Sweden. He received the M.Sc. in computer engineering from the Budapest University of Technology and Economics, Hungary in 1999, the M.Sc. in business administration from the Corvinus University of Budapest, Hungary in 2003, and the PhD in Telecommunications from KTH in 2006. He worked as a consultant in the field of access networks, streaming media and videoconferencing 1999-2001. He was a visiting researcher at the Swedish Institute of Computer Science in 2008, a Fulbright research scholar at University of Illinois at Urbana-Champaign in 2012-2013, and an invited professor at EPFL in 2014-2015. He served as area editor of Computer Communications 2014-2021, as editor of IEEE Transactions on Mobile Computing 2019-2023, serves on the TPC of conferences like IEEE Infocom and ACM e-Energy, and is chair of the steering committee of IEEE SmartGridComm. He has received several best paper awards from IFIP and IEEE in recent years. His research interests include the design and analysis of mobile computing systems, game theoretical models of networked systems, and cyber-physical system security and resilience.

3:00 p.m. - 3:30 p.m.

Session I

Federated Learning for Embodied Intelligence

3:00 p.m. - 3:15 p.m.

Federated Self-Evolving Embodied AI Agents

Leming Shen and Yuanqing Zheng (The Hong Kong Polytechnic University, Hong Kong)

Show abstract

The integration of Large Language Models (LLMs) fosters Embodied AI (EAI) agents to perceive and interact with the physical world through natural language instructions. However, existing EAI agents typically operate within fixed agentic workflows and predefined design spaces, struggling to handle rapidly evolving real-world EAI scenarios (e.g., diverse task pipelines, dynamic environments). Unlike prior systems that treat LLMs mainly as text-generators following rigid workflows, we fully exploit their reasoning capabilities to allow agents to determine their own workflows on the fly. To this end, we propose FSEAI that encourages EAI agents to self-explore and self-evolve via federated collaboration across heterogeneous environments. Inspired by human learning processes, we distill EAI tasks into three atomic operations (observe, reason, act), and empower agents to explore workflows by dynamically and adaptively selecting an appropriate next operation based on the current state. Our evaluations show that, with federated collaboration, our FSEAI agents can achieve up to a 42.6% higher task success rate and a 41.6K token cost reduction than state-of-the-art (SOTA) baselines, while maintaining adaptability to unforeseen EAI scenarios. This highlights the potential of reasoning-driven adaptive agentic workflow towards cognitive EAI.

3:15 p.m. - 3:30 p.m.

pFedNavi: Structure-Aware Personalized Federated Vision-Language Navigation for Embodied AI

Qingqian Yang and Hao Wang (Stevens Institute of Technology, USA); Sai Qian Zhang (Meta/New York University, USA); Jian Li (Stony Brook University, USA); Yang Hua (Queen's University Belfast, United Kingdom (Great Britain)); Miao Pan (University of Houston, USA); Tao Song, Zhengwei Qi and Haibing Guan (Shanghai Jiao Tong University, China)

Show abstract

Vision-Language Navigation (VLN) aims to enable embodied AI agents to follow natural language instructions and navigate in real indoor environments, but training strong VLN policies typically requires large-scale trajectory-instruction data collected in private spaces, including homes and offices, raising substantial privacy concerns. Federated Learning (FL) provides a natural remedy by keeping data on-device and only exchanging model updates; however, vanilla FL struggles in VLN due to extreme cross-client heterogeneity in environment layouts, navigation graphs, and personalized instruction styles, making a single global model suboptimal for many clients. In this paper, we propose pFedNavi, a structure-aware and dynamically adaptive personalized federated learning framework tailored for VLN. Our key idea is to personalize where it matters: pFedNavi (i) adaptively identifies client-specific layers via layer-wise mixing coefficients, and (ii) performs fine-grained parameter fusion on the selected components (e.g., the encoder-decoder projection and environment-sensitive decoder layers) to balance global knowledge sharing with local specialization. We evaluate pFedNavi on two standard VLN benchmarks, R2R and RxR, under both ResNet and CLIP visual representations. Across metrics, pFedNavi consistently outperforms the baseline-FedAvg-based VLN, achieving improvement in navigation success (up to 7.5% in success rate) and navigation trajectory quality (up to 7.8% in normalized dynamic time warping), while converging 1.38× faster under non-IID conditions.

3:30 p.m. - 4:00 p.m.

Break

Coffee break

4:00 p.m. - 5:00 p.m.

Session II

Adaptive and Collaborative Embodied Intelligence

4:00 p.m. - 4:15 p.m.

E-RECAP: Embodied REplanning with Cost-Aware Pruning

Shuaijun Liu and Ningxin Su (The Hong Kong University of Science and Technology (Guangzhou), China)

Show abstract

Replanning is a core and frequent process in embodied AI systems, where agents continuously adapt their plans based on partial observations and dynamic environments. With the increasing adoption of large language models and vision-language models as high-level planners, each replanning cycle requires processing long context histories that grow over time and with the number of agents, making replanning inference cost a critical system bottleneck. However, existing work largely treats this cost as unavoidable, lacking system-level optimization methods specifically targeting the replanning stage. We observe that replanning is a high-level reasoning process that can tolerate information approximation, and not all historical tokens are equally important for current replanning decisions. We propose E-RECAP, a cost-aware token pruning method for embodied replanning that dynamically removes low-importance tokens from planner context while preserving replanning-critical information. E-RECAP is a system-level, drop-in optimization module that operates only during replanning, without modifying task definitions, environments, or control policies. Experimental results demonstrate significant acceleration: up to 2.64 times speedup in single-GPU embodied evaluations and up to 39.7 times speedup in long-context synthetic replanning under multi-GPU inference, while maintaining task success rates and replanning behavior. E-RECAP provides a plug-and-play solution for scalable embodied AI systems, particularly beneficial for long-horizon tasks and multi-agent scenarios where context accumulation is inevitable.

4:15 p.m. - 4:30 p.m.

VULCAN: Vision-Language-Model Enhanced Multi-Agent Cooperative Navigation for Indoor Fire-Disaster Response

Shengding Liu and Qiben Yan (Michigan State University, USA)

Show abstract

Indoor fire disasters pose severe challenges to autonomous search and rescue due to dense smoke, high temperatures, and dynamically evolving indoor environments. In such time-critical scenarios, multi-agent cooperative navigation is particularly useful, as it enables faster and broader exploration than single-agent approaches. However, existing multi-agent navigation systems are primarily vision-based and designed for benign indoor settings, leading to significant performance degradation under fire-driven dynamic conditions. In this paper, we present VULCAN, a multi-agent cooperative navigation framework based on multi-modal perception and vision-language models (VLMs), tailored for indoor fire disaster response. We extend the Habitat-Matterport3D benchmark by simulating physically realistic fire scenarios, including smoke diffusion, thermal hazards, and sensor degradation. We evaluate representative multi-agent cooperative navigation baselines under both normal and fire-driven environments. Our results reveal critical failure modes of existing methods in fire scenarios and underscore the necessity of robust perception and hazard-aware planning for reliable multi-agent search and rescue.

4:30 p.m. - 4:45 p.m.

Rethinking IoT for Embodied AI: From Data Infrastructure to Intelligence Infrastructure

Shuai Tong and Jiliang Wang (Tsinghua University, China)

Show abstract

Embodied Artificial Intelligence (Embodied AI) depends on tightly coupled perception-action loops that require continuous sensing, low-latency communication, distributed reasoning, and coordinated actuation, making it inherently dependent on large-scale Internet of Things (IoT) infrastructures. Despite this natural alignment, existing IoT systems were largely designed for data acquisition and connectivity rather than for closed-loop embodied intelligence, leading to fundamental architectural mismatches that constrain the scalability, responsiveness, and robustness of embodied AI in real-world deployments. This paper presents a challenge-oriented technical perspective on the role of IoT in embodied AI. Using a four-layer embodied AI stack (i.e., sensing, cognition, action, and networking) as a unifying analytical framework, we systematically examine how current IoT technologies support, shape, and limit embodied AI systems. Our analysis exposes key tensions spanning semantic sensing, latency-efficiency trade-offs in communication, fragmentation of distributed computation, and weak coordination of networked actuation. Based on these findings, we argue for new abstractions and cross-layer co-design principles to align future IoT infrastructures with the requirements of closed-loop embodied intelligence.

4:45 p.m. - 5:00 p.m.

KV-SC: KV-Based Semantic Collaboration for Distributed Embodied Intelligence Networks

Baoxia Du and Ruidong Li (Kanazawa University, Japan); Yinfeng Cao (The Hong Kong Polytechnic University, Hong Kong); Dusit Niyato (Nanyang Technological University, Singapore)

Show abstract

Edge devices for embodied AI operate under tight computation and energy budgets, motivating edge-cloud collaboration for high-quality closed-loop control. However, conventional text-to-text (T2T) collaboration introduces substantial end-to-end (E2E) latency due to token-by-token decoding, and suffers from output-format drift that undermines dependable actuation. We propose KV-SC, a key-value (KV) cache-based semantic collaboration framework that replaces intermediate text messages with KV-cache transmission. The cloud performs prefill-only inference to produce KV caches as structured semantic states, which are then injected into the edge controller to decode stable, machine-parsable actions. To further reduce communication overhead, we introduce a two-stage KV-cache compression method and show that aggressive compression preserves control quality while significantly reducing downlink payload. Experiments in a multivehicle VMAS road-traffic environment demonstrate that KV-SC improves closed-loop driving performance and consistently reduces E2E latency compared to T2T baselines.

5:00 p.m. - 6:00 p.m.

Session III

Systems, Security, and Optimization for Embodied AI Infrastructure

5:00 p.m. - 5:15 p.m.

Jiao: Bridging Isolation and Customization in Mixed Criticality Robotics

James Yen, ZhiBai Huang and Zhixiang Wei (Shanghai Jiaotong University, China); Tinghao Yi, Shupeng Zeng and Liang Pang (Openmind, China); Songtao Xue (Shanghai Jiaotong University, China); Zhengwei Qi (Shanghai Jiao Tong University, China)

Show abstract

Consumer robotics demands consolidation of safety-critical control, perception pipelines, and user applications on shared multicore platforms. While static partitioning hypervisors provide hardware-enforced isolation, directly transplanting automotive architectures encounters an expertise asymmetry problem in which end-users modifying robot behavior lack the systems knowledge that platform developers possess. We present an architecture addressing this challenge through three integrated components. A Safe IO Cell provides hardware-level override capability. A Parameter Synchronization Service encapsulates cross-domain complexity. A Safety Communication Layer implements IEC 61508-compliant verification. Our empirical evaluation on an ARM Cortex-A55 platform demonstrates that partition isolation reduces cycle-period jitter by 84.5% and cuts tail timing error by nearly an order of magnitude (p99 (|jitter|) from 69.0 (μs) to 7.8 (μs)), eliminating all >50 (μs) excursions.

5:15 p.m. - 5:30 p.m.

MEC-based HFL: A Fair Aggregation Approach for Non-IID Data via Minimum Enclosing Circle

Fan-Hsun Tseng, Jiang-Yi Zeng and Yu-Teng Lai (National Cheng Kung University, Taiwan); Hsin-Hung Cho and Chi-Yuan Chen (National Ilan University, Taiwan)

Show abstract

Resource-constrained entities in an embodied intelligence network can be trained collaboratively by federated learning (FL). However, the data heterogeneity of these physical entities poses a significant difficulty in training convergence, particularly for the non-independent and identically distributed (non-IID) problem. Although hierarchical FL (HFL) uses clustering approaches to group similar data through two-phase model aggregation, it still suffers from the data heterogeneity of two inter-clusters. To solve the problem, this paper presents a fairness aggregation method based on the Minimum Enclosing Circle (MEC) and named it MEC-based HFL. The proposed MEC-based HFL calculates the center of the smallest enclosing circle that encompasses all data points from physical entities at the edge. It provides a fairer aggregation result by being closer to underrepresented clusters and balancing contributions from all edge devices. Extensive experiments demonstrate that the MEC-based HFL offers superior learning accuracy, faster model convergence, and better training efficiency compared to other classic FL algorithms.

5:30 p.m. - 5:45 p.m.

Towards Operating System Automated Optimization via System Knowledge and Tunability Embedding

Yuxin Ren, Donghui Chen, Yun Hao, Nan Zhang, Zhipeng Xie, Yanqiang Liu, Bo Wan, Bo Zhang, Xu Wang, Wanming Hu, Haili Bai, Ning Jia and Xinwei Hu (Huawei Technologies, China)

Show abstract

Performance and cost optimization are always the top demand of OS. Traditional OS architectures struggle with optimization due to growing system complexity and inadequate heuristic methods. By leveraging advancements in generative AI and large models, this positioning paper defines a new OS paradigm for self-optimization, self-adaptation, and self-evolution. Furthermore, achieving this paradigm is viable due to system knowledge and tunability embedding. Our existing case studies and deployment practices demonstrate that integrating accumulated empirical knowledge with generative AI can effectively tune system to realize 3.89% to 53.2% performance improvement for various realistic applications.

5:45 p.m. - 6:00 p.m.

Multimodal Behavioral PUF Authentication for Physical AI Robots Combining Electromechanical Actuation and Computational Load Signatures

Jiho Lee, Jaehyung Jeong, Jayeon Pyo and JaeSeung Song (Sejong University, Korea (South))

Show abstract

Physical AI systems, particularly humanoid and collaborative robots, are increasingly deployed in safety-critical applications. However, existing authentication mechanisms do not check whether the claimed computational identity is actually tied to the specific actuator hardware that executes the motion, leaving a gap between cyber and physical domains. This paper presents a multimodal behavioral Physical Unclonable Function (PUF) architecture that combines electromechanical actuation characteristics with computational load signatures for device authentication. Our system generates challenge-response pairs by requesting robots to perform simultaneous physical motions and cryptographic operations, extracting unique signatures from motor impedance variations and processor power consumption patterns. A synchronized data acquisition module captures temporal correlations between cyber and physical domains at microsecond precision, while a fuzzy extractor based on Bose-Chaudhuri-Hocquenghem (BCH) codes ensures stable key generation despite environmental variations. Our theoretical analysis indicates that the proposed PUF can provide more than 128 bits of effective min-entropy, with an expected inter-device uniqueness of 49.2% and intra-device reliability of 94.1% under nominal conditions. In addition, the requirement to execute coupled motion and cryptographic tasks makes signal injection, pure software emulation, and straightforward replay attacks significantly harder, because an adversary must reproduce both domains and their timing relationship. The proposed method targets embodied AI platforms in which robots can physically manipulate their environment, and aims to provide an authentication mechanism that explicitly accounts for this actuation capability as part of the security boundary.

6:00 p.m. - 6:05 p.m.

Closing

Closing Remarks

Baochun Li (University of Toronto, General Chair)

Embodied Intelligence Networks (EIN)

Program Schedule

Welcome and Workshop Introduction

Keynote talk

Federated Learning for Embodied Intelligence

Federated Self-Evolving Embodied AI Agents

pFedNavi: Structure-Aware Personalized Federated Vision-Language Navigation for Embodied AI

Coffee break

Adaptive and Collaborative Embodied Intelligence

E-RECAP: Embodied REplanning with Cost-Aware Pruning

VULCAN: Vision-Language-Model Enhanced Multi-Agent Cooperative Navigation for Indoor Fire-Disaster Response

Rethinking IoT for Embodied AI: From Data Infrastructure to Intelligence Infrastructure

KV-SC: KV-Based Semantic Collaboration for Distributed Embodied Intelligence Networks

Systems, Security, and Optimization for Embodied AI Infrastructure

Jiao: Bridging Isolation and Customization in Mixed Criticality Robotics

MEC-based HFL: A Fair Aggregation Approach for Non-IID Data via Minimum Enclosing Circle

Towards Operating System Automated Optimization via System Knowledge and Tunability Embedding

Multimodal Behavioral PUF Authentication for Physical AI Robots Combining Electromechanical Actuation and Computational Load Signatures

Closing Remarks