When manipulating objects in the real world, we need reactive feedback policies that take into account sensor information to inform decisions. This study aims to determine how different encoders can be used in a reinforcement learning (RL) framework to interpret the spatial environment in the local surroundings of a robot arm. Our investigation focuses on com- paring real-world vision with 3D scene inputs, exploring new architectures in the process.
We built on the SERL framework, providing us with a sample efficient and stable RL foundation we could build upon, while keeping training times minimal. The results of this study indicate that spatial information helps to significantly outperform the visual counterpart, tested on a box picking task with a vacuum gripper.
Reliable object manipulation with vacuum grippers is challenging due to variations in object size, shape, and surface texture. Traditional visual feedback struggles to generalize, making reinforcement learning with spatial representations a promising approach.
We employ a UR5 robotic arm with a Robotiq EPick vacuum gripper, equipped with two Intel RealSense depth cameras. The system captures spatial information using voxel grids, depth images, and RGB inputs, which are processed by different encoder architectures.
We build on the Sample-Efficient Robotic Reinforcement Learning (SERL) framework, comparing different perception strategies: visual encoders (ResNet), depth maps, and voxel grids (3D convolutions). Training is conducted in a real-world environment with diverse box shapes and conditions. Actor and Learner nodes run simultaneously, using AgentLace.
We evaluated our approach on a real-world vacuum gripping task using a variety of objects, including both seen and unseen samples. Policies trained with voxel-based encodings were compared against traditional image-based reinforcement learning methods.
Voxel-based policies demonstrated superior performance, achieving higher success rates and faster execution times compared to image-based approaches. The use of pre-trained VoxNet models and observation space symmetries further enhanced policy robustness and generalization to unseen objects. Below are evaluation videos showcasing different gripping scenarios:
I continued this work with a multi-robot setup where a box is handed over between two UR5 arms using suction grippers. Each arm is equipped with a wrist-mounted D405 camera that provides voxelized, localized point cloud observations to the RL pipeline. Episodes begin after a scripted box pickup handled during the environment reset. Both robots are jointly controlled by a single RL policy operating in a 14-dimensional action space. Safety mechanisms automatically detect and handle potential collisions if the arms move too close together or if excessive forces are detected during the handover.
The policy was trained on the green box in the vertical handover configuration. The horizontal handover and the red box scenarios were never seen during training.
Due to time limitations, I have not yet conducted any experiments on the multi-robot handover setup.
@article{sutter2025comparison,
title={A comparison of visual representations for real-world reinforcement learning in the context of vacuum gripping},
author={Sutter, Nico and Hartmann, Valentin N and Coros, Stelian},
journal={arXiv preprint arXiv:2503.02405},
year={2025}
}