Robotic Grasping and Manipulation Competition
Track 1: Human-to-Robot Handover

Team: Shakey's Legacy

Abhimanyu Bhowmik¹, Petr Vanc¹, Lukáš Rustler², Pradyun Sharma³, Jan Behrens¹

¹CIIRC, Czech Technical University in Prague ²FEE, Czech Technical University in Prague ³TU Delft

ICRA 2026 RGMC Track 1

Overview of our complete human-to-robot handover pipeline — from perception and grasp planning to compliant robot control and safe cup delivery.

Overview

The ICRA 2026 Robotic Grasping and Manipulation Competition (RGMC), Track 1 challenges teams to build a complete robotic system capable of receiving a container (such as a cup or bottle) directly from a human hand and delivering it to a specified target location, without spilling its contents.

Our team, Shakey's Legacy, tackled this challenge by developing a fully integrated perception-to-action pipeline: a multimodal vision system identifies and tracks the container in real time, estimates its geometry and fill level, and predicts the human's intent; a behaviour-tree orchestrator coordinates the handover phases from observation through compliant grasping to controlled placement. Tactile sensing at the end-effector ensures a firm yet gentle grip throughout the transfer.

The system was developed and evaluated on a Franka Emika Panda robotic arm instrumented with two Intel RealSense D455 RGB-D cameras and DIGIT fingertip tactile sensors, following the official CORSMAL benchmark protocol.

Trial Runs

Trials with different cups and participants.

System Architecture

End-to-End Pipeline. From dual RGB-D camera input, the system performs object and hand segmentation, 3D pose estimation, cup geometry inference, and behaviour-tree orchestration of the handover phases. Click to enlarge.

Perception — Object & Hand Tracking.

Two extrinsically calibrated Intel RealSense D455 cameras independently segment the container and human hand, estimating 3D positions and velocities. Masks from Grounded-DINO and SAM2 are merged at the tracker level for robust, occlusion-resilient state estimation.

Geometry Estimation — Cup Shape & Fill Level.

A parametric cylinder model is fitted to the segmented point cloud to estimate the cup's top width, bottom width, height and volume. VLM based geometric reasoning infer the fill level and approximate content mass.

Behaviour Tree — Phase Orchestration.

A multi-phase behaviour tree (Initialization → Cup Tracking → Handover & Grasping → Robot Delivery) transitions based on touch sensing and proximity thresholds, ensuring the robot acts only when the human is ready to release.

Grasping & Tactile Control.

A grasp classifier selects top, middle, or bottom grasp configurations based on the detected hand-object layout. DIGIT fingertip tactile sensors confirm contact. Velocity control keeps the interaction fast and safety bounderies keep it safe.

Perception Interface. Our real-time tracker GUI shows, session details, live matrices, system status, and updates the configuration paramters for each trials. Click to enlarge.

Demonstrations

Key components of our system demonstrated on hardware. Select a demo to explore.

Full Handover with Segmentation & RViz

Complete end-to-end handover: perception → behaviour tree → grasp → delivery

The robot detects the container in the human hand, transitions through the four handover phases, grasps the cup with tactile feedback, and delivers it to the target without spilling. The split view shows the live camera feed with segmentation overlays alongside the RViz robot model visualisation.

Tactile Sensing with DIGIT

Fingertip contact sensing for grasp confirmation and slip detection

The DIGIT vision-based tactile sensor embedded in the robot's fingertips captures high-resolution contact imagery in real time. This enables the system to detect initial finger contact with the cup surface, ensuring a secure grasp without excessive gripping force.

Cup Geometry Estimation

Parametric shape fitting for container dimensions and fill level

Using the segmented depth point cloud from the RealSense cameras, we fit a parametric cylinder model to estimate the container's center. We estimate the top width, bottom width, and height using parametric circle fitting and RANSAC-based outlier rejection. These dimensions inform the robot's grasp configuration selection and are used to estimate the fill level and approximate content mass prior to grasping.

Grounded-DINO + SAM2 Segmentation

Open-vocabulary object detection and instance segmentation

We use Grounded-DINO for language-guided open-vocabulary detection of containers and human hands, combined with SAM2 for precise pixel-level instance segmentation of these detected bounding boxes. This allows us to robustly segment the container and human hand across varying cup shapes, materials, and lighting conditions, without retraining for new object categories.

Benchmark Results

S8 Offline Score Evaluation

s8 benchmark run under CORSMAL evaluation protocol

Complete recording of our S8 benchmark evaluation trial for End‑Effector Reachability Accuracy estimation. This task evaluates how accurately a robot can reach six canonical target poses in its workspace.

Official Competition Score. Results from the RGMC 2026 Track 1 evaluation under the CORSMAL benchmark protocol. View full qualification results. Click image to enlarge.

The Team

Team Shakey's Legacy is a collaboration between the Machine Learning Group (Prof. Robert Babuška) and the Humanoid and Cognitive Robotics Group (Dr. Matěj Hoffmann) at Czech Technical University in Prague.

Acknowledgements

This work was supported by ROBOPROX (Robotics and Advanced Industrial Production) and conducted at the Czech Institute of Informatics, Robotics and Cybernetics (CIIRC) and the Faculty of Electrical Engineering (FEE), Czech Technical University in Prague.

Robotic Grasping and Manipulation Competition Track 1: Human-to-Robot Handover