Voxel-SERL: Real-world reinforcement learning in the context of vacuum gripping

Abstract

When manipulating objects in the real world, we need reactive feedback policies that take into account sensor information to inform decisions. This study aims to determine how different encoders can be used in a reinforcement learning (RL) framework to interpret the spatial environment in the local surroundings of a robot arm. Our investigation focuses on com- paring real-world vision with 3D scene inputs, exploring new architectures in the process.

We built on the SERL framework, providing us with a sample efficient and stable RL foundation we could build upon, while keeping training times minimal. The results of this study indicate that spatial information helps to significantly outperform the visual counterpart, tested on a box picking task with a vacuum gripper.

Approach & Findings

Problem Setting

Reliable object manipulation with vacuum grippers is challenging due to variations in object size, shape, and surface texture. Traditional visual feedback struggles to generalize, making reinforcement learning with spatial representations a promising approach.

Setup

We employ a UR5 robotic arm with a Robotiq EPick vacuum gripper, equipped with two Intel RealSense depth cameras. The system captures spatial information using voxel grids, depth images, and RGB inputs, which are processed by different encoder architectures.

Training

We build on the Sample-Efficient Robotic Reinforcement Learning (SERL) framework, comparing different perception strategies: visual encoders (ResNet), depth maps, and voxel grids (3D convolutions). Training is conducted in a real-world environment with diverse box shapes and conditions. Actor and Learner nodes run simultaneously, using AgentLace.

Experiments

We evaluated our approach on a real-world vacuum gripping task using a variety of objects, including both seen and unseen samples. Policies trained with voxel-based encodings were compared against traditional image-based reinforcement learning methods.

Results

Voxel-based policies demonstrated superior performance, achieving higher success rates and faster execution times compared to image-based approaches. The use of pre-trained VoxNet models and observation space symmetries further enhanced policy robustness and generalization to unseen objects. Below are evaluation videos showcasing different gripping scenarios:

(1) Behavior Tree

(2) Behavioral Cloning

(4) Image based Policy

(5) Depth based Policy

(6) VoxNet based Policy

(9) Pretrained VoxNet based Policy

BibTeX

@article{sutter2025comparison,
  title={A comparison of visual representations for real-world reinforcement learning in the context of vacuum gripping},
  author={Sutter, Nico and Hartmann, Valentin N and Coros, Stelian},
  journal={arXiv preprint arXiv:2503.02405},
  year={2025}
}