SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

Junjie Zhang^1,2, Chenjia Bai^2*, Haoran He^3,2, Wenke Xia^4,2, Zhigang Wang², Bin Zhao², Xiu Li¹, Xuelong Li^2,5*

¹Tsinghua University, ²Shanghai Artificial Intelligence Laboratory, ³Hong Kong University of Science and Technology, ⁴Renmin University of China, ⁵Institute of Artificial Intelligence (TeleAI), China Telecom, P. R. China ^*Corresponding Author

ICML 2024

Paper arXiv Code

Summary

Acquiring a multi-task imitation policy in 3D manipulation poses challenges in terms of scene understanding and action prediction. Current methods employ both 3D representation and multi-view 2D representation to predict the poses of the robot’s end-effector. However, they still require a considerable amount of high-quality robot trajectories, and suffer from limited generalization in unseen tasks and inefficient execution in long-horizon reasoning.

In this paper, we propose SAM-E, a novel architecture for robot manipulation by leveraging a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning. Specifically, we adopt Segment Anything (SAM) pre-trained on a huge number of images and promptable masks as the foundation model for extracting task-relevant features, and employ parameter-efficient fine-tuning on robot data for a better understanding of embodied scenarios.

To address long-horizon reasoning, we develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass, notably enhancing execution efficiency. Experimental results from various instruction-following tasks demonstrate that SAM-E achieves superior performance with higher execution efficiency compared to the baselines, and also significantly improves generalization in few-shot adaptation to new tasks.

Method

Overview of SAM-E. (i) The SAM encoder provides promptable visual embedding of single-view observations after fine-tuning on embodied scenarios with parameter-efficient LoRA. (ii) Multi-view transformer achieves cross-view information integration and vision-language alignment. (iii) The coherent action sequence is predicted via with temporal imitation for efficient multi-step execution.

Experiment Results

We evaluate SAM-E in RLBench, which is a challenging multi-task 3D manipulation benchmark. To perform a fair comparison to baselines, we use the same settings as the state-of-art method RVT by using 18 tasks with 249 variations in experiments. Moreover, we evaluate the generalization ability of SAM-E via few-shot adaptation in 6 new tasks.

Real World

To demonstrate the effectiveness of SAM-E in real-world scenarios, we train and test the model in a real-world setup with a Franka Panda robot arm in the real world. As shown in the Figure, we train SAM-E in 5 tasks with 10 demonstrations per task, including Put the towel on the cabinet, Stack the block, Close the drawer, Pick up the banana, and Put the orange into the drawer. See the videos below for the results. This is the link to the collected real-device data samples.

Success: Put the orange into the drawer

Success: Put the banana in the basket

Multi-task Videos

Put in Drawer

Success: put the item in the top drawer

Success: put the item in the middle drawer

Success: put the item in the bottom drawer

Failure: put the item in the bottom drawer

Sweep to Dustpan

Success: sweep dirt to the short dustpan

Success: sweep dirt to the tall dustpan

Meat off Grill

Success: take the steak off the grill

Success: take the chicken off the grill

Open Drawer

Success: open the top drawer

Success: open the middle drawer

Success: open the top drawer

Failure: open the bottom drawer

Turn Tap

Success: turn right tap

Success: turn left tap

Success: turn right tap

Close Jar

Success: close the red jar

Success: close the cyan jar

Success: close the azure jar

Failure: close the navy jar

Drag Stick

Success: use the stick to drag the cube onto the navy target

Success: use the stick to drag the cube onto the red target

Success: use the stick to drag the cube onto the violet target

Success: use the stick to drag the cube onto the red target

Stack Blocks

Success: stack 3 yellow blocks

Success: stack 2 red blocks

Failure: stack 3 teal blocks

Failure: stack 4 navy blocks

Screw Bulb

Success: screw in the blue light bulb

Success: screw in the rose light bulb

Success: screw in the blue light bulb

Failure: screw in the white light bulb

Slide Block

Success: slide the block to blue target

Success: slide the block to green target

Success: slide the block to pink target

Failure: slide the block to pink target

Put in Safe

Success: put the money away in the safe on the top shelf

Success: put the money away in the safe on the middle shelf

Success: put the money away in the safe on the bottom shelf

Success: put the money away in the safe on the middle shelf

Place Wine

Success: stack the wine bottle to the left of the rack

Success: stack the wine bottle to the middle of the rack

Success: stack the wine bottle to the right of the rack

Success: stack the wine bottle to the middle of the rack

Put in Cupboard

Success: put the coffee in the cupboard

Success: put the mustard jello in the cupboard

Failure: put the crackers in the cupboard

Sort Shape

Success: put the star in the shape sorter

Success: put the cylinder in the shape sorter

Failure: put the triangular prism in the shape sorter

Push Buttons

Success: push the maroon button, then push the green button, then push the navy button

Success: push the maroon button

Success: push the maroon button, then push the green button

Success: push the maroon button, then push the blue button

Insert Peg

Success: put the ring on the black spoke

Success: put the ring on the purple spoke

Success: put the ring on the azure spoke

Failure: put the ring on the orange spoke

Stack Cups

Failure: stack the other cups on top of the maroon cup

Failure: stack the other cups on top of the violet cup

Place Cups

Failure: place 3 cups on the cup holder

Failure: place 2 cups on the cup holder

Few-shot Videos

Meat on Grill

Success: put the steak on the grill

Success: put the chicken on the grill

Failure: put the steak on the grill

Open Jar

Success: open the olive jar

Success: open the red jar

Success: open the azure jar

Failure: open the silver jar

Screw Nail

Success: screw the nail in to the block

Failure: screw the nail in to the block

Solve Puzzle

Success: solve the puzzle

Failure: solve the puzzle

Toilet Seat Down

Success: toilet seat down

Tv on

Success: turn on the TV

Failure: turn on the TV

BibTeX

@inproceedings{2024sam,
         author    = {Junjie Zhang, Chenjia Bai, Haoran He, Zhigang Wang, Bin Zhao, Xiu Li, Xuelong Li},
         title     = {SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation},
         booktitle = {International Conference on Machine Learning}
         year      = {2024},
}