Acquiring a multi-task imitation policy in 3D manipulation poses challenges in terms of scene understanding and action prediction. Current methods employ both 3D representation and multi-view 2D representation to predict the poses of the robot’s end-effector. However, they still require a considerable amount of high-quality robot trajectories, and suffer from limited generalization in unseen tasks and inefficient execution in long-horizon reasoning.
In this paper, we propose SAM-E, a novel architecture for robot manipulation by leveraging a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning. Specifically, we adopt Segment Anything (SAM) pre-trained on a huge number of images and promptable masks as the foundation model for extracting task-relevant features, and employ parameter-efficient fine-tuning on robot data for a better understanding of embodied scenarios.
To address long-horizon reasoning, we develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass, notably enhancing execution efficiency. Experimental results from various instruction-following tasks demonstrate that SAM-E achieves superior performance with higher execution efficiency compared to the baselines, and also significantly improves generalization in few-shot adaptation to new tasks.
Overview of SAM-E. (i) The SAM encoder provides promptable visual embedding of single-view observations after fine-tuning on embodied scenarios with parameter-efficient LoRA. (ii) Multi-view transformer achieves cross-view information integration and vision-language alignment. (iii) The coherent action sequence is predicted via with temporal imitation for efficient multi-step execution.
We evaluate SAM-E in RLBench, which is a challenging multi-task 3D manipulation benchmark. To perform a fair comparison to baselines, we use the same settings as the state-of-art method RVT by using 18 tasks with 249 variations in experiments. Moreover, we evaluate the generalization ability of SAM-E via few-shot adaptation in 6 new tasks.
Success: Put the orange into the drawer
Success: Put the banana in the basket
Success: put the item in the top drawer
Success: put the item in the middle drawer
Success: put the item in the bottom drawer
Failure: put the item in the bottom drawer
Success: sweep dirt to the short dustpan
Success: sweep dirt to the short dustpan
Success: sweep dirt to the short dustpan
Success: sweep dirt to the tall dustpan
Success: take the steak off the grill
Success: take the steak off the grill
Success: take the steak off the grill
Success: take the chicken off the grill
Success: open the top drawer
Success: open the middle drawer
Success: open the top drawer
Failure: open the bottom drawer
Success: turn right tap
Success: turn left tap
Success: turn left tap
Success: turn right tap
Success: close the red jar
Success: close the cyan jar
Success: close the azure jar
Failure: close the navy jar
Success: use the stick to drag the cube onto the navy target
Success: use the stick to drag the cube onto the red target
Success: use the stick to drag the cube onto the violet target
Success: use the stick to drag the cube onto the red target
Success: stack 3 yellow blocks
Success: stack 2 red blocks
Failure: stack 3 teal blocks
Failure: stack 4 navy blocks
Success: screw in the blue light bulb
Success: screw in the rose light bulb
Success: screw in the blue light bulb
Failure: screw in the white light bulb
Success: slide the block to blue target
Success: slide the block to green target
Success: slide the block to pink target
Failure: slide the block to pink target
Success: put the money away in the safe on the top shelf
Success: put the money away in the safe on the middle shelf
Success: put the money away in the safe on the bottom shelf
Success: put the money away in the safe on the middle shelf
Success: stack the wine bottle to the left of the rack
Success: stack the wine bottle to the middle of the rack
Success: stack the wine bottle to the right of the rack
Success: stack the wine bottle to the middle of the rack
Success: put the coffee in the cupboard
Success: put the coffee in the cupboard
Success: put the mustard jello in the cupboard
Failure: put the crackers in the cupboard
Success: put the star in the shape sorter
Success: put the cylinder in the shape sorter
Success: put the cylinder in the shape sorter
Failure: put the triangular prism in the shape sorter
Success: push the maroon button, then push the green button, then push the navy button
Success: push the maroon button
Success: push the maroon button, then push the green button
Success: push the maroon button, then push the blue button
Success: put the ring on the black spoke
Success: put the ring on the purple spoke
Success: put the ring on the azure spoke
Failure: put the ring on the orange spoke
Failure: stack the other cups on top of the maroon cup
Failure: stack the other cups on top of the maroon cup
Failure: stack the other cups on top of the violet cup
Failure: place 3 cups on the cup holder
Failure: place 2 cups on the cup holder
Failure: place 2 cups on the cup holder
Success: put the steak on the grill
Success: put the steak on the grill
Success: put the chicken on the grill
Failure: put the steak on the grill
Success: open the olive jar
Success: open the red jar
Success: open the azure jar
Failure: open the silver jar
Success: screw the nail in to the block
Success: screw the nail in to the block
Success: screw the nail in to the block
Failure: screw the nail in to the block
Success: solve the puzzle
Success: solve the puzzle
Success: solve the puzzle
Failure: solve the puzzle
Success: toilet seat down
Success: toilet seat down
Success: toilet seat down
Success: toilet seat down
Success: turn on the TV
Success: turn on the TV
Success: turn on the TV
Failure: turn on the TV
@inproceedings{2024sam,
author = {Junjie Zhang, Chenjia Bai, Haoran He, Zhigang Wang, Bin Zhao, Xiu Li, Xuelong Li},
title = {SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation},
booktitle = {International Conference on Machine Learning}
year = {2024},
}