TLDR
We present a successful approach
to learning humanoid dexterous manipulation using sim-to-real
reinforcement learning, achieving robust generalization and high
performance without the need for human demonstration.
Learning generalizable robot manipulation policies, especially for complex multi-fingered humanoids, remains a significant challenge. Existing approaches primarily rely on extensive data collection and imitation learning, which are expensive, labor-intensive, and difficult to scale. Sim-to-real reinforcement learning (RL) offers a promising alternative, but has mostly succeeded in simpler state-based or single-hand setups. How to effectively extend this to vision-based, contact-rich bimanual manipulation tasks remains an open question. In this paper, we introduce a practical sim-to-real RL recipe that trains a humanoid robot to perform three challenging dexterous manipulation tasks: grasp-and-reach, box lift and bimanual handover. Our method features an automated real-to-sim tuning module, a generalized reward formulation based on contact and object goals, a divide-and-conquer policy distillation framework, and a hybrid object representation strategy with modality-specific augmentation. We demonstrate high success rates on unseen objects and robust, adaptive policy behaviors -- highlighting that vision-based dexterous manipulation via sim-to-real RL is not only viable, but also scalable and broadly applicable to real-world humanoid manipulation tasks.
We train a humanoid robot with two multi-fingered hands to perform a range of contact-rich dexterous manipulation tasks on various objects. Observations are obtained from a third-view camera, an egocentric camera, and robot proprioception. The deployed reinforcement learning policies can adapt to a variety of unseen real-world objects that have varying physical properties (e.g., shape, size, color, material, mass) and remain robust against force disturbances.
Our contributions include an automated real-to-sim tuning module, a generalized reward design scheme, a divide-and-conquer distillation process, and a mixture of sparse and dense object representations. These techniques collectively enable the training of robust, generalizable, and dexterous manipulation policies that can be successfully transferred to real-world humanoid robots.
Our policy is capable of performing dexterous grasping on a diverse range of objects, including ones that are out-of-distribution with respect to the training distribution. The emergent dexterity also enables our policy to solve hard grasping tasks that require precise finger motions, such as grasping small and slippery objects.
We observe the emergence of diverse grasp patterns from the same policy even with the same object. The grasp patterns are adaptive to variations in both object properties and states.
During policy deployment, we perturb objects at random times by poking, and pulling, pushing them along random directions using a picker tool or hands. Our policy is robust against these random external forces, and can adapt quickly to sustain continuous policy execution. Left video shows grasp policies; right video shows lift and handover policies.
Our policy also exhibits interesting emergent failure recovery behaviors that maintain its robustness even when the force disturbances are so strong that the object is dropped. We observe that our policy can quickly adjust the finger motions and perform regrasping to continue policy execution. Left video shows recovery of a grasp policy; right video shows recovery of a lift policy.
During training, we sometimes observe RL policies that develop remarkably dynamic and creative motions. The left video demonstrates a "standard" handover policy, while the right video showcases a highly dynamic variant. Although these fascinating behaviors often emerge from exploiting simulator dynamics and do not transfer well to the real world, we find them intriguing and wanted to share these entertaining examples with the community :)
@article{lin2025sim, author={Lin, Toru and Sachdev, Kartik and Fan, Linxi and Malik, Jitendra and Zhu, Yuke}, title={Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids}, journal={arXiv:2502.20396}, year={2025} }
We thank members of NVIDIA GEAR lab for help with hardware infrastructure, in particular Zhenjia Xu, Yizhou Zhao, and Zu Wang. This work was partially conducted during TL's internship at NVIDIA. TL is supported by NVIDIA and the National Science Foundation fellowship.