GHOST: Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation

👻 GHOST

Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation

Sriram Krishna¹, Ben Eisner¹, Haotian Zhan¹, Ying Yuan¹, Haoyu Zhen², Chuang Gan²,
Shubham Tulsiani¹, David Held¹ ¹Robotics Institute, Carnegie Mellon University ²UMass Amherst Robotics: Science and Systems (RSS) 2026

[pdf] [arxiv] [code]

GHOST learns skills from robot teleoperation data and optionally uses human demonstrations to generalize to novel tasks. We train a hierarchical policy that decouples embodiment-agnostic goal prediction (π_hi) from embodiment-specific action execution (π_lo). By training π_hi on both robot and human data and π_lo purely on robot data, GHOST transfers learned manipulation skills to out-of-distribution tasks with third person human video demonstrations.

Abstract.

We present GHOST, a framework for learning visuomotor manipulation policies that generalize beyond the training distribution. GHOST factorizes control into (i) a high-level policy that predicts the next sub-goal as a distribution over 3D end-effector poses from multi-view RGB-D observations, and (ii) a low-level goal-conditioned controller that executes embodiment-specific actions. To condition image-based policies on 3D goals, we introduce a simple spatial interface that projects predicted goals into the image plane and represents them as end-effector heatmaps. Across a suite of manipulation tasks, this hierarchical factorization consistently improves performance and robustness compared to a flat Diffusion Policy.

Further, we show that this hierarchical interface also makes it easy to incorporate human demonstrations without relying on (noisy) action retargeting. As sub-goals are largely embodiment-agnostic, we train the high-level policy on human video to specify how learned skills should be applied and composed, while keeping the low-level policy trained purely on robot data. This hierarchy enables adaptation to novel objects and task variations using a small number of human demonstrations.

Approach.

The GHOST framework learns a hierarchy over sub-goal end-effector poses. During training, sub-goal boundaries provide supervision for the high-level planner π_hi, while the full robot trajectories provide action supervision for the low-level controller π_lo. At test time, π_hi predicts the next end-effector sub-goal (as a distribution) from workspace observations and language, and π_lo executes goal-conditioned control toward that sub-goal. This separation isolates long-horizon reasoning in π_hi and maintains precise, embodiment-specific control in π_lo.

1. Data Collection and Processing.

We collect two types of demonstrations: robot teleoperation data and human demonstrations. For robot data, we collect demonstrations across multiple task variants, each demonstrating the same skill (e.g., pick-and-place). When possible, sub-goals s are extracted automatically by identifying the timesteps where the gripper state transitions from open → close and vice versa. For human data, we optionally collect demonstrations of the same skill in a novel setting using either new objects or novel applications of the skill. We track 3D hand poses using off-the-shelf hand pose estimators, resolving the weak-perspective scale ambiguity by segmenting the hand with Grounded-SAM and scaling the detection with the observed depth of the hand. Sub-goals are manually annotated at the timesteps where a sub-goal terminates.

Human demonstration collection. We collect human video demonstrations for out-of-distribution generalization, annotating sub-goal timesteps where key manipulation events occur (e.g., grasping, placing, folding transitions).

To enable cross-embodiment transfer, we represent the end-effector pose as a sparse set of 4 3D keypoints rather than a position and SO(3) rotation. For robot data, we sample a point cloud from the gripper mesh and select 4 points at the base of the gripper, the tips of the two gripper fingers, and the grasping center. For human data, we extract hand point clouds from the MANO mesh of the hand pose and select 4 corresponding points - the palm, tips of the thumb and index finger, and the grasping center. This unified end-effector representation allows the high-level policy to be trained on both robot and human demonstrations.

ALOHA Gripper

MANO Hand

End-effector representations for robot and human demonstrations. Instead of representing the gripper pose as a translation and SO(3) rotation, we represent the gripper pose as a set of 3D keypoints (red spheres): gripper base, fingertip locations, and grasping center. Left: Parallel-jaw gripper. Right: MANO hand. (Interactive - drag to rotate, scroll to zoom)

2. High-Level Goal Prediction Policy.

GHOST High-level sub-goal prediction architecture: RGB-D observations from multiple cameras are processed with a DINOv3 encoder, with patch tokens augmented by 3D coordinates. Additional context (gripper state, language embedding, embodiment name) is encoded via separate MLPs. A decoder-only transformer processes all tokens, with each patch predicting the GMM parameters of a 3D sub-goal distribution over end-effector keypoints.

The high-level policy π_hi predicts the end-effector pose at the next sub-goal timestep. We parameterize it as a decoder-only transformer operating on RGB-D observations from multiple cameras. Each RGB-D image is processed independently with a frozen DINOv3 encoder, producing K×C patch tokens (K per camera). Each patch token is augmented with the features of the 3D coordinate [x, y, z] of the patch center, processed with an MLP. Additional tokens encode the current gripper pose and the task language embedding via Flan-T5, and a learnable token specifies whether the demonstration is from a human or robot. We also include register tokens, which have been shown to improve performance on dense prediction tasks. To handle the multimodality inherent in demonstration data, we model the goal distribution as a dense per-patch Gaussian Mixture Model (GMM). Each patch token predicts (1) a mixing weight w_i, and (2) four 3D residual vectors {δ_i,1, δ_i,2, δ_i,3, δ_i,4} relative to the patch center p_i; the global goal distribution is a mixture of K×C isotropic Gaussians (one per patch).

High-level policy GMM predictions at inference. We visualize the mixture components of the GMM, with the opacity encoding mixing weight w_i and arrows showing projections of 3D residuals δ_i from patch centers p_i.

3. Low-Level Goal-Conditioned Policy.

We instantiate π_lo as a Diffusion Policy that generates action chunks by denoising conditioned on observations. Each point in the 3D goal e_s is projected onto the image plane of each camera using the camera parameters, yielding sparse 2D coordinates p_i. These are converted to dense end-effector heatmaps H ∈ ℝ^3×h×w, where each channel c encodes the square-root pixel distance field √(||x_xy - p_c||₂ / d_max) from a keypoint, normalized by the image diagonal d_max. We use heatmaps rather than single-pixel binary masks because we empirically find them to yield better performance; we hypothesize this is due to the denser supervisory signal. For practical implementation reasons, we select three of the heatmap channels corresponding to non-collinear points in the goal and pass these channels through standard 3-channel image encoder networks before incorporating into the policy network.

End-effector heatmap generation for goal conditioning. Predicted 3D keypoints are projected to 2D coordinates, and then converted to dense distance field heatmaps that encode spatial proximity to each keypoint.

GHOST Low-level goal-conditioned policy architecture. Images from each camera and the projected end-effector heatmap images are processed independently with ResNet encoders and concatenated along with the proprioceptive input into a global conditioning vector for the Diffusion Policy.

Interactive 4D Visualization.

Explore an interactive visualization of a GHOST rollout on the onesie folding task. The viewer contains synchronized feeds from 3 workspace cameras alongside a 3D point cloud reconstruction of the scene, all scrubbable over time. Drag to rotate the 3D view, scroll to zoom. Use the timeline at the bottom to scrub through the episode. (Video downsampled for efficient streaming).

Interactive 4D visualization of a GHOST rollout. 3 synchronized camera views and a 3D point cloud reconstruction, powered by Rerun. Drag to rotate the 3D view, scroll to zoom, and use the timeline to scrub through the episode.

Experiments and Results.

We evaluate GHOST across multiple challenging real-world manipulation tasks to demonstrate both in-distribution performance improvements and out-of-distribution generalization through human demonstrations.

Task Overview

Task	Data Source	# Demos	Generalization Type
Pick-and-Place
plate-on-table	Robot	20	—
plate-in-bin	Robot	20	—
mug-in-bin	Robot	20	—
mug-on-table	Human	20	Object combination
Cloth Folding
fold-onesie	Robot	33	—
fold-shirt	Robot	50	—
fold-onesie-ood	Human	17	Object instance
fold-towel	Human	50	Object category + Skill composition
Hammer Pin
hammer-pin	Human+Robot	100	—

                plate-on-table
                1/1
                ✓
              

                plate-on-table
                1/1
                ✓
              

                plate-on-table
                1/1
                ✓
              

                mug-on-table
                1/1
                ✓
              

                mug-on-table
                1/1
                ✓
              

                mug-on-table
                0/1
                ✗
              

Pick and Place: GHOST generalizes pick-and-place skills to novel object combinations (mug-on-table) after training on in-distribution tasks (plate-on-table, plate-in-bin, mug-in-bin). We overlay a colormap on the image to visualize the predicted goal.

Hammer Pin: Pick up a hammer and strike the target pin. The task requires precise grasping of the hammer tool and striking the correct pin. We overlay a colormap on the image to visualize the predicted goal.

                fold-onesie
                5/5
                ✓
              

                fold-onesie
                5/5
                ✓
              

                fold-onesie
                5/5
                ✓
              

                fold-onesie-ood
                5/5
                ✓
              

                fold-onesie-ood
                4/5
                ✗
              

                fold-onesie-ood
                5/5
                ✓
              

Cloth Folding: After training on folding onesies and shirts (in-distribution), GHOST shows meaningful progress in generalizing to novel object instances (fold-onesie-ood) and novel object categories with skill composition (fold-towel). We overlay a colormap on the image to visualize the predicted goal.

Pick and Place

Method	plate-on-table	mug-on-table
DP	80.0 ± 15.0	13.3 ± 9.2
MimicPlay	65.0 ± 15.0	28.3 ± 12.5
GHOST (Ours - Robot Only)	83.3 ± 10.8	55.0 ± 15.0
GHOST (Ours)	98.3 ± 2.5	63.3 ± 13.3

Onesie Folding

Task	Method	1-step	2-step	3-step	4-step	Final
fold-onesie	DP	90.0	76.7	53.3	40.0	10.0 ± 11.7
	MimicPlay	93.3	76.7	70.0	70.0	46.7 ± 16.7
	GHOST (Ours - Robot Only)	100.0	100.0	96.7	86.7	80.0 ± 15.0
	GHOST (Ours)	100.0	100.0	100.0	90.0	83.3 ± 13.3
fold-onesie-ood (Novel Object Instance)	DP	76.7	60.0	46.7	26.7	10.0 ± 10.0
	MimicPlay	63.3	36.7	30.0	20.0	0.0 ± 0.0
	GHOST (Ours - Robot Only)	100.0	100.0	73.3	60.0	43.3 ± 16.7
	GHOST (Ours)	100.0	100.0	93.3	86.7	56.7 ± 16.7

Towel Folding

Method	fold-towel (Novel Category + Skill Composition)
MimicPlay	16.7 ± 13.3
GHOST (Ours)	36.7 ± 16.7

Hammer Pin Results

Method	hammer-pin
DP	16.7 ± 13.3
MimicPlay	33.3 ± 16.7
GHOST (Ours - Robot Only)	50.0 ± 16.7
GHOST (Ours)	70.0 ± 16.7

Key Findings

Do hierarchical policies improve in-distribution performance even without human data?
Yes. For plate-on-table, nearly all methods saturate in performance, as the task is simple with sufficient training data. However, for long-horizon complex tasks, we see significant increases: on fold-onesie, performance increases from 10% (DP) to 80% (GHOST - Robot Only) final success, showing large benefits from hierarchical decomposition. Similarly, on hammer-pin, which requires precise grasping of the hammer tool and striking the correct pin, performance significantly improves from DP (16.7%) to GHOST - Robot Only (50%).

Do human demonstrations enable transferring learned skills to novel object instances, categories, and contexts?
Yes. Human demonstrations unlock meaningful OOD transfer of learned skills to novel objects and skill compositions. GHOST achieves 63.3% success on mug-on-table, a task featuring a combination of objects unseen in robot demonstrations. On fold-onesie-ood, GHOST achieves 56.7% final success vs 43.3% for GHOST - Robot Only and 0% for MimicPlay. On the hardest task of generalizing a policy to a novel object category and skill combination (fold-towel), GHOST achieves 36.7% success as compared to 16.7% with the MimicPlay baseline.

BibTeX

@inproceedings{krishna2026ghost,
  title     = {GHOST: Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation},
  author    = {Krishna, Sriram and Eisner, Ben and Zhan, Haotian and Yuan, Ying and
               Zhen, Haoyu and Gan, Chuang and Tulsiani, Shubham and Held, David},
  booktitle = {Robotics: Science and Systems (RSS)},
  year      = {2026}
}

Content