Twisting Lids Off with Two Hands

TLDR We train two robot hands to twist bottle lids using deep RL and sim-to-real.

Overview

Manipulating objects with two multi-fingered hands has been a long-standing challenge in robotics, attributed to the contact-rich nature of many manipulation tasks and the complexity inherent in coordinating a high-dimensional bimanual system. In this work, we consider the problem of twisting lids of various bottle-like objects with two hands, and demonstrate that policies trained in simulation using deep reinforcement learning can be effectively transferred to the real world. With novel engineering insights into physical modeling, real-time perception, and reward design, the policy demonstrates generalization capabilities across a diverse set of unseen objects, showcasing dynamic and dexterous behaviors. Our findings serve as compelling evidence that deep reinforcement learning combined with sim-to-real transfer remains a promising approach for addressing manipulation problems of unprecedented complexity.

Robustness

During policy deployment, we perturb objects at random times by poking or pushing them along random directions using a picker tool or a hand. Our policy is robust against these random external forces, and can adapt quickly to sustain continuous manipulation. Both videos show how our policy can reorient and translate a perturbed object back to a stable in-hand pose.

Emergent Behavior

Our policy exhibits interesting emergent behaviors that maintain its robustness when deployed on objects that are significantly different from those in the training distribution. We observe that our policy can skillfully adjust the finger gaits and grasps of both hands to recover objects from unstable states back to stable poses. Our policy also adapts its movements to objects of different shapes and sizes.

Perception

RGB

Segmentation Masks

Depth

We use 3D object keypoints extracted from RGBD images as object representation. Specifically, we generate two separate segmentation masks for the bottle body and lid on the first frame, and track the masks throughout all remaining frames. To approximate the 3D center-of-mass coordinates of the bottle body and lid, we calculate the center positions of their masks in the image plane, then obtain noisy depth readings from a depth camera to recover their corresponding 3D positions. Interestingly, the segmentation mask tracker we use prioritizes spatial continuity over semantic meaning during tracking. As a result, when the lid gets twisted off, the tracker parses 3D position of the thread instead of the dropped lid, leading to additional robustness of our policy (bottom video).

Acknowledgements

We thank Chen Wang and Yuzhe Qin for helpful discussions on hardware setup and simulation of the Allegro Hand. TL is supported by fellowships from the National Science Foundation and UC Berkeley. ZY is supported by funding from InnoHK Centre for Logistics Robotics and ONR MURI N00014-22-1-2773. HQ is supported by the DARPA Machine Common Sense and ONR MURI N00014-21-1-2801.