GHOST learns skills from robot teleoperation data and optionally uses human demonstrations to generalize to novel tasks. We train a hierarchical policy that decouples task-specific goal prediction (πhi) from embodiment-specific action execution (πlo). By training πhi on both robot and human data and πlo purely on robot data, GHOST transfers learned manipulation skills to out-of-distribution tasks with third person human video demonstrations.
Further, we show that this hierarchical interface also makes it easy to incorporate human demonstrations without relying on (noisy) action retargeting. As sub-goals are largely embodiment-agnostic, we train the high-level policy on human video to specify how learned skills should be applied and composed, while keeping the low-level policy trained purely on robot data. This hierarchy enables adaptation to novel objects and task variations using a small number of human demonstrations.
Human demonstration collection. We collect human video demonstrations for out-of-distribution generalization, annotating sub-goal timesteps where key manipulation events occur (e.g., grasping, placing, folding transitions).
ALOHA Gripper
MANO Hand
End-effector representations for robot and human demonstrations. Instead of explicitly retargeting the gripper pose to a translation and SO(3) rotation, we represent gripper poses as sets of 3D keypoints (red spheres): gripper base, fingertip locations, and grasping center. Left: Parallel-jaw gripper. Right: MANO hand. (Interactive - drag to rotate, scroll to zoom)
GHOST High-level goal prediction architecture: RGB-D observations from multiple cameras are processed with a DINOv3 encoder, with patch tokens augmented by 3D coordinates. Additional context (gripper state, language embedding, embodiment name) is encoded via separate MLPs. An encoder-only transformer processes all tokens, with each patch predicting GMM parameters of a 3D goal distribution over end-effector keypoints.
High-level policy GMM predictions at inference. We visualize the mixture components of the GMM, with the opacity encoding mixing weight wi and arrows showing projections of 3D residuals δi from patch centers pi.
End-effector heatmap generation for goal conditioning. Predicted 3D keypoints are projected to 2D coordinates, and then converted to dense distance field heatmaps that encode spatial proximity to each keypoint.
GHOST Low-level goal-conditioned policy architecture. Images from each camera and the projected end-effector heatmap images are processed independently with ResNet encoders and concatenated along with the proprioceptive input into a global conditioning vector for the Diffusion Policy.
Interactive 4D visualization of a GHOST rollout. 3 synchronized camera views and a 3D point cloud reconstruction, powered by Rerun. Drag to rotate the 3D view, scroll to zoom, and use the timeline to scrub through the episode.
Task Overview
| Task | Data Source | # Demos | Generalization Type |
|---|---|---|---|
| Pick-and-Place | |||
| plate-on-table | Robot | 20 | — |
| plate-in-bin | Robot | 20 | — |
| mug-in-bin | Robot | 20 | — |
| mug-on-table | Human | 20 | Object combination |
| Cloth Folding | |||
| fold-onesie | Robot | 33 | — |
| fold-shirt | Robot | 50 | — |
| fold-onesie-ood | Human | 17 | Object instance |
| fold-towel | Human | 50 | Object category + Skill composition |
| Hammer Pin | |||
| hammer-pin | Human+Robot | 100 | — |
Pick and Place: GHOST generalizes pick-and-place skills to novel object combinations (mug-on-table) after training on in-distribution tasks (plate-on-table, plate-in-bin, mug-in-bin). We overlay a colormap on the image to visualize the predicted goal.
Hammer Pin: Pick up a hammer and strike the target pin. The task requires precise grasping of the hammer tool and striking the correct pin. We overlay a colormap on the image to visualize the predicted goal.
Cloth Folding: After training on folding onesies and shirts (in-distribution), GHOST shows meaningful progress in generalizing to novel object instances (fold-onesie-ood) and novel object categories with skill composition (fold-towel). We overlay a colormap on the image to visualize the predicted goal.
Pick and Place Results
| Method | plate-on-table | mug-on-table |
|---|---|---|
| DP | 90 | 10 |
| MimicPlay | 50 | 30 |
| GHOST (Ours - Robot Only) | 90 | 60 |
| GHOST (Ours) | 100 | 70 |
Onesie Folding Results
| Task | Method | 1-step | 2-step | 3-step | 4-step | Final |
|---|---|---|---|---|---|---|
| fold-onesie | DP | 80 | 60 | 50 | 40 | 20 |
| MimicPlay | 90 | 80 | 80 | 80 | 50 | |
| GHOST (Ours - Robot Only) | 100 | 100 | 100 | 90 | 90 | |
| GHOST (Ours) | 100 | 100 | 100 | 90 | 90 | |
| fold-onesie-ood (Novel Object Instance) |
DP | 40 | 40 | 10 | 0 | 0 |
| MimicPlay | 70 | 40 | 40 | 40 | 0 | |
| GHOST (Ours - Robot Only) | 100 | 100 | 60 | 50 | 40 | |
| GHOST (Ours) | 100 | 100 | 90 | 70 | 50 |
Towel Folding Results
| Method | fold-towel (Novel Category + Skill Composition) |
|---|---|
| MimicPlay | 20 |
| GHOST (Ours) | 40 |
Hammer Pin Results
| Method | hammer-pin |
|---|---|
| DP | 20 |
| GHOST (Ours - Robot Only) | 60 |
| GHOST (Ours) | 70 |
Key Findings
Do hierarchical policies improve in-distribution performance even without human data?
Yes. For plate-on-table, nearly all methods saturate in performance, as the task is simple with sufficient training data. However, for long-horizon complex tasks, we see significant increases: on fold-onesie, performance increases from 20% (DP) to 90% (GHOST - Robot Only) final success, showing large benefits from hierarchical decomposition. Similarly, on hammer-pin, which requires precise grasping of the hammer tool and striking the correct pin, performance significantly improves from DP (20%) to GHOST - Robot Only (60%).
Do human demonstrations enable transferring learned skills to novel object instances, categories, and contexts?
Yes. Human demonstrations unlock meaningful OOD transfer of learned skills to novel objects and skill compositions. GHOST achieves 70% success on mug-on-table, a task featuring a combination of objects unseen in robot demonstrations. On fold-onesie-ood, GHOST achieves 50% final success vs 40% for GHOST - Robot Only and 0% for MimicPlay and flat DP. On the hardest task of generalizing to a novel object category and skill combination (fold-towel), GHOST achieves 40% success rate compared to 20% with MimicPlay.
Coming soon