👻 GHOST
Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation
GHOST Teaser Figure

GHOST learns skills from robot teleoperation data and optionally uses human demonstrations to generalize to novel tasks. We train a hierarchical policy that decouples task-specific goal prediction (πhi) from embodiment-specific action execution (πlo). By training πhi on both robot and human data and πlo purely on robot data, GHOST transfers learned manipulation skills to out-of-distribution tasks with third person human video demonstrations.

Abstract.
We present GHOST, a framework for learning visuomotor manipulation policies that generalize beyond the training distribution. GHOST factorizes control into (i) a high-level policy that predicts the next sub-goal as a distribution over 3D end-effector poses from multi-view RGB-D observations, and (ii) a low-level goal-conditioned controller that executes embodiment-specific actions. To condition image-based policies on 3D goals, we introduce a simple spatial interface that projects predicted goals into the image plane and represents them as end-effector heatmaps. Across a suite of manipulation tasks, this hierarchical factorization consistently improves performance and robustness compared to a flat Diffusion Policy.

Further, we show that this hierarchical interface also makes it easy to incorporate human demonstrations without relying on (noisy) action retargeting. As sub-goals are largely embodiment-agnostic, we train the high-level policy on human video to specify how learned skills should be applied and composed, while keeping the low-level policy trained purely on robot data. This hierarchy enables adaptation to novel objects and task variations using a small number of human demonstrations.
Approach.
The GHOST framework learns a hierarchy over sub-goal end-effector poses. During training, sub-goal boundaries provide supervision for the high-level planner πhi, while the full robot trajectories provide action supervision for the low-level controller πlo. At test time, πhi predicts the next end-effector sub-goal (as a distribution) from workspace observations and language, and πlo executes goal-conditioned control toward that sub-goal. This separation isolates long-horizon reasoning in πhi and maintains precise, embodiment-specific control in πlo.
1. Data Collection and Processing.
We collect two types of demonstrations: robot teleoperation data and human demonstrations. For robot data, we collect demonstrations across multiple task variants, each demonstrating the same skill (e.g., pick-and-place), to provide in-distribution coverage. Sub-goals are extracted automatically by identifying timesteps where the gripper state changes. For human data, we collect demonstrations of the same skill in novel settings using new objects or novel applications. We track 3D hand poses using off-the-shelf hand pose estimators and manually annotate sub-goal timesteps.

Human demonstration collection. We collect human video demonstrations for out-of-distribution generalization, annotating sub-goal timesteps where key manipulation events occur (e.g., grasping, placing, folding transitions).

To enable cross-embodiment transfer, we represent end-effector poses as a set of 4 3D keypoints that can be extracted from both robot grippers and human hands: the gripper base/palm, finger tips, and grasping center. This unified representation allows the high-level policy to be trained on both robot and human demonstrations.

ALOHA Gripper

MANO Hand

End-effector representations for robot and human demonstrations. Instead of explicitly retargeting the gripper pose to a translation and SO(3) rotation, we represent gripper poses as sets of 3D keypoints (red spheres): gripper base, fingertip locations, and grasping center. Left: Parallel-jaw gripper. Right: MANO hand. (Interactive - drag to rotate, scroll to zoom)

2. High-Level Goal Prediction Policy.
High-Level Goal Prediction Architecture

GHOST High-level goal prediction architecture: RGB-D observations from multiple cameras are processed with a DINOv3 encoder, with patch tokens augmented by 3D coordinates. Additional context (gripper state, language embedding, embodiment name) is encoded via separate MLPs. An encoder-only transformer processes all tokens, with each patch predicting GMM parameters of a 3D goal distribution over end-effector keypoints.

The high-level policy πhi predicts the end-effector pose at the next sub-goal timestep. We parameterize it as an encoder-only transformer operating on RGB-D observations from multiple cameras. Each RGB-D image is processed independently with a frozen DINOv3 encoder, producing patch tokens that are augmented with 3D coordinates of the patch center. Additional tokens encode the current gripper pose, language embedding via Flan-T5, and a learnable token specifying whether the demonstration is from a human or robot. To handle multimodality in demonstrations, we model the goal distribution as a dense per-patch Gaussian Mixture Model (GMM) where each patch predicts mixing weights and 3D residual vectors relative to its patch center.

High-level policy GMM predictions at inference. We visualize the mixture components of the GMM, with the opacity encoding mixing weight wi and arrows showing projections of 3D residuals δi from patch centers pi.

3. Low-Level Goal-Conditioned Policy.
We instantiate πlo as a Diffusion Policy that generates action chunks by denoising conditioned on observations and spatial goals. Each point in the 3D goal is projected onto the image plane of each camera using camera parameters, yielding sparse 2D coordinates. These are converted to dense end-effector heatmaps where each channel encodes the pixel distance field from each keypoint. We choose heatmaps as goal representations because they provide smooth gradients at many spatial locations for optimization stability.
End-Effector Heatmap Generation

End-effector heatmap generation for goal conditioning. Predicted 3D keypoints are projected to 2D coordinates, and then converted to dense distance field heatmaps that encode spatial proximity to each keypoint.

Low-Level Goal-Conditioned Policy Architecture

GHOST Low-level goal-conditioned policy architecture. Images from each camera and the projected end-effector heatmap images are processed independently with ResNet encoders and concatenated along with the proprioceptive input into a global conditioning vector for the Diffusion Policy.

Interactive 4D Visualization.
Explore an interactive visualization of a GHOST rollout on the onesie folding task. The viewer contains synchronized feeds from 3 workspace cameras alongside a 3D point cloud reconstruction of the scene, all scrubbable over time. Drag to rotate the 3D view, scroll to zoom. Use the timeline at the bottom to scrub through the episode. (Video downsampled for efficient streaming).

Interactive 4D visualization of a GHOST rollout. 3 synchronized camera views and a 3D point cloud reconstruction, powered by Rerun. Drag to rotate the 3D view, scroll to zoom, and use the timeline to scrub through the episode.

Experiments and Results.
We evaluate GHOST across multiple challenging real-world manipulation tasks to demonstrate both in-distribution performance improvements and out-of-distribution generalization through human demonstrations.

Task Overview

Task Data Source # Demos Generalization Type
Pick-and-Place
plate-on-table Robot 20 —
plate-in-bin Robot 20 —
mug-in-bin Robot 20 —
mug-on-table Human 20 Object combination
Cloth Folding
fold-onesie Robot 33 —
fold-shirt Robot 50 —
fold-onesie-ood Human 17 Object instance
fold-towel Human 50 Object category + Skill composition
Hammer Pin
hammer-pin Human+Robot 100 —

Pick and Place Results

Method plate-on-table mug-on-table
DP 90 10
MimicPlay 50 30
GHOST (Ours - Robot Only) 90 60
GHOST (Ours) 100 70

Onesie Folding Results

Task Method 1-step 2-step 3-step 4-step Final
fold-onesie DP 80 60 50 40 20
MimicPlay 90 80 80 80 50
GHOST (Ours - Robot Only) 100 100 100 90 90
GHOST (Ours) 100 100 100 90 90
fold-onesie-ood
(Novel Object Instance)
DP 40 40 10 0 0
MimicPlay 70 40 40 40 0
GHOST (Ours - Robot Only) 100 100 60 50 40
GHOST (Ours) 100 100 90 70 50

Towel Folding Results

Method fold-towel (Novel Category + Skill Composition)
MimicPlay 20
GHOST (Ours) 40

Hammer Pin Results

Method hammer-pin
DP 20
GHOST (Ours - Robot Only) 60
GHOST (Ours) 70

Key Findings

Do hierarchical policies improve in-distribution performance even without human data?
Yes. For plate-on-table, nearly all methods saturate in performance, as the task is simple with sufficient training data. However, for long-horizon complex tasks, we see significant increases: on fold-onesie, performance increases from 20% (DP) to 90% (GHOST - Robot Only) final success, showing large benefits from hierarchical decomposition. Similarly, on hammer-pin, which requires precise grasping of the hammer tool and striking the correct pin, performance significantly improves from DP (20%) to GHOST - Robot Only (60%).

Do human demonstrations enable transferring learned skills to novel object instances, categories, and contexts?
Yes. Human demonstrations unlock meaningful OOD transfer of learned skills to novel objects and skill compositions. GHOST achieves 70% success on mug-on-table, a task featuring a combination of objects unseen in robot demonstrations. On fold-onesie-ood, GHOST achieves 50% final success vs 40% for GHOST - Robot Only and 0% for MimicPlay and flat DP. On the hardest task of generalizing to a novel object category and skill combination (fold-towel), GHOST achieves 40% success rate compared to 20% with MimicPlay.

BibTeX

Coming soon