
About
I am Xiqian Yu (余茜倩), a Research and Development Engineer at the Embodied AI Center, Shanghai AI Laboratory, working with Dr. Tai Wang. My research interests lie in embodied AI, vision-language-action models, and large-scale multimodal learning for embodied agents.
My recent research and work centers on embodied foundation models, spanning both navigation and manipulation. In terms of tasks and applications, I am particularly interested in streaming vision-language navigation and dual-system cooperation for generalizable agents. On the training and infrastructure side, my focus is on co-training with heterogeneous robotic and multimodal data, large-scale data processing pipelines, and distributed training systems. Moving forward, I am highly motivated by how data, training infrastructure, and model architecture can be jointly optimized to drive the scalability and generalization of embodied foundation models.
Selected Publications


InternVLA-N1: An Open Dual-System Vision-Language Navigation Foundation Model with Learned Latent Plans

StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

NaVid-4D: Unleashing Spatial Intelligence in Egocentric RGB-D Videos for Vision-and-Language Navigation

GaussianSR: 3D Gaussian Super-Resolution with 2D Diffusion Priors
Education
University of Science and Technology of China
Sep. 2022 - Jul. 2025Master, Electronics and Communication Engineering
Supervisor: Prof. Zhibo Chen
Shandong University
Sep. 2016 - Jul. 2020Bachelor, Electronic Engineer and Information Science
Internship
Shanghai AI Laboratory
Jan. 2025 - Jun. 2025Intern, Embodied AI Center
Vision Language Action
Galbot
Jan. 2024 - Jul. 2024Intern, Algorithm Center
Vision Language Navigation
Skills
Programming & Frameworks
- Languages & Core: Python, C/C++, CUDA
- DL & Systems: PyTorch, Hugging Face (Accelerate, Transformers), DeepSpeed, Slurm Cluster
- Distributed Training: Multi-node multi-GPU training, NCCL optimization
Multimodal & VLA Pre-training
- Multimodal Pre-training: Vision-language pre-training and co-training, multimodal representation alignment
- Data Infrastructure: Large-scale multimodal data processing, heterogeneous robot and multimodal dataset organization, dataset mixture design
- Training Operations: Co-training over heterogeneous robot and multimodal datasets, high-throughput data loading, distributed batch scheduling, precision alignment
VLA Post-training & Alignment
- Action Adaptation: Vision-language-action end-to-end fine-tuning, action expert integration
- Planning & Evaluation: Long-horizon task planning, closed-loop evaluation
- Model Development: VLA post-training, alignment, and training infrastructure for scalable embodied foundation models