RoboDexVLM: Visual Language Model-Enabled Task Planning and Motion Control for Dexterous Robot Manipulation

Accepted by International Conference on Intelligent Robots and Systems (IROS) 2025!

The Hong Kong University of Science and Technology (Guangzhou) The Hong Kong University of Science and Technology

Abstract

This paper introduces RoboDexVLM, an innovative framework for robot task planning and grasp detection tailored for a collaborative manipulator equipped with a dexterous hand. Previous methods focus on simplified and limited manipulation tasks, which often neglect the complexities associated with grasping a diverse array of objects in a long-horizon manner. In contrast, our proposed framework utilizes a dexterous hand capable of grasping objects of varying shapes and sizes while executing tasks based on natural language commands. The proposed approach has the following core components: First, a robust task planner with a task-level recovery mechanism that leverages vision-language models (VLMs) is designed, which enables the system to interpret and execute open-vocabulary commands for long sequence tasks. Second, a language-guided dexterous grasp perception algorithm is presented based on robot kinematics and formal methods, tailored for zero-shot dexterous manipulation with diverse objects and commands. Comprehensive experimental results validate the effectiveness, adaptability, and robustness of RoboDexVLM in handling long-horizon scenarios and performing dexterous grasping. These results highlight the framework's ability to operate in complex environments, showcasing its potential for open-vocabulary dexterous manipulation.

Video Abstract of RoboDexVLM

Static Demonstration

Key frames of the task "Put the fruits into the basket".

Key frames of the task "Open the drawer and pick out the objects inside".

Key frames of the task "Put the orange into the basket" with recovery from disturbance.

Video Demonstrations

Note: The following videos are played at 1x speed without any speed up.

Open-Vocabulary Dexterous Manipulation

Human: "Put the red apple into the hat"

Human: "Put the green apple into the hat"

Human: "Put the middle carambola into the box"

Human: "Put the right carambola into the hat"

Human: "Put the apple into the hat"

Human: "Put the smaller carambola into the hat"

Long-Horizon Dexterous Manipulation

Human: "Place the peach on the tabletop from the drawer"

Human: "Put all the fruits into the box"

Manipulation Recovery from Failure

Human: "Put the carambola into the basket" (failure without recovery)

Human: "Put the orange into the hat" (success with recovery)

BibTeX

@article{liu2025robodexvlm, title={RoboDexVLM: Visual Language Model-Enabled Task Planning and Motion Control for Dexterous Robot Manipulation}, author={Liu, Haichao and Guo, Sikai and Mai, Pengfei and Cao, Jiahang and Li, Haoang and Ma, Jun}, journal={arXiv preprint arXiv:2503.01616}, year={2025} }

RoboDexVLM: Vision-Language Model-Enabled Task Planning and Motion Control for Dexterous Robot Manipulation