SHENZHEN, China, May 28, 2026 /PRNewswire/ — X Square Robot today announced the open-source release of Wall-OSS-0.5, a Vision-Language-Action (VLA) model designed for real-world robotic manipulation. The release aims to answer a key question in embodied AI: can VLA pretraining produce robot capabilities that are directly observable on physical hardware, rather than serving only as initialization for downstream fine-tuning?
Wall-OSS-0.5 is designed to make pretrained robotic capability directly observable on physical robots, before any task-specific fine-tuning. To evaluate this, the model was tested in its pretrained form on a 17-task real-robot zero-shot suite spanning semantic understanding, rigid-object manipulation, deformable-object manipulation, fine-grained manipulation, and long-horizon tasks. The pretrained checkpoint achieved task-progress scores above 80 on multiple tasks, including Block Sorting (100), Fruit Sorting (96), Ring Stacking (86), and the held-out deformable task Rope Tightening (82). These results show that VLA pretraining alone can produce measurable and transferable robot behavior, rather than merely serving as initialization for downstream adaptation.
To support this capability, Wall-OSS-0.5 is trained in a single stage on a three-source mixture: self-collected manipulation data, curated open-source multi-embodiment trajectories, and a 90M-sample multimodal corpus that includes embodied bridge samples synthesized from action trajectories.
At the core of the training recipe is gradient-bridged co-training, where action supervision directly shapes the VLM backbone during pretraining. Robotic actions are integrated into the core representation learning process rather than treated as a separate downstream module.
In this framework, discrete action-token cross-entropy acts as the gradient bridge, providing a VLM-native training signal that injects action awareness into the backbone. Multimodal cross-entropy anchors grounded vision-language understanding, including instruction following and embodied scene understanding. Flow matching trains the continuous action generator used during execution on physical robots.
These discrete and continuous learning paths play complementary roles. Action tokens help the backbone learn robot behavior at the representation level, while flow matching generates executable continuous actions for control. Ablation studies show that removing key components of this co-training recipe substantially degrades real-robot performance.
Across pretraining, performance improves on both seen and unseen tasks from 50k to 400k steps, with average task progress rising from 26.1 to 50.0 on seen tasks and from 24.2 to 53.6 on unseen tasks.
To further improve action representation quality, Wall-OSS-0.5 introduces a Vision-Aligned RVQ Action Tokenizer. Unlike rule-based action tokenization approaches, this tokenizer aligns discrete action tokens with visual and multimodal semantics, creating a more semantically grounded interface between robot actions and multimodal understanding.
The model also introduces Action-Space Supervision for flow matching. Instead of supervising only velocity-field predictions, the loss is defined directly in the recovered action space, improving convergence speed and stabilizing continuous action generation during training. The objective shifts supervision from the velocity field to the recovered action trajectory.
For large-scale training, Wall-OSS-0.5 implements DMuon, a distributed Muon optimizer for VLA co-training. DMuon partitions matrix-level Newton-Schulz computation across sharded parameters and reduces end-to-end optimizer overhead by up to 100x relative to a naive implementation.
Beyond manipulation performance, Wall-OSS-0.5 preserves broad vision-language capability while strengthening embodied grounding. Multimodal evaluation shows stable overall performance, with embodied grounding improving by 21.8 points.
On a 15-task real-robot fine-tuning suite, Wall-OSS-0.5 achieves 60.5 average task progress, outperforming π0.5 by 17.5 percentage points and widening the margin to 26 percentage points on the 10-task manipulation subset.
With this release, X Square Robot is making the full Wall-OSS-0.5 stack available to the community, including model weights, training code, training recipes, and optimizer implementations. The goal is to provide a reproducible foundation for researchers and developers working on embodied intelligence.
Wall-OSS-0.5 offers a practical path from large-scale VLA pretraining to robot behavior that can be tested directly in real-world settings. By open-sourcing Wall-OSS-0.5 along with its training and optimization stack, X Square Robot hopes to support further research, adaptation, and development toward general-purpose embodied AI.
More Details
- GitHub: https://github.com/X-Square-Robot/wall-x
- Hugging Face: https://huggingface.co/x-square-robot/wall-oss-0.5
- Project Page: https://x2robot.com/oss#resources
- Paper: https://x2robot.com/api/files/file/wall_oss_05.pdf
