Imitation learning for robot manipulation
SkiP: When to Skip and When to Refine for Efficient Robot Manipulation
Without changing the model architecture, SkiP reshapes the training targets so one policy learns to skip predictable motion and reserve precise control for contact-rich segments.
Previous imitation learning policies predict future actions at every control step, whether in smooth motion phases or precise, contact-rich operation phases. This uniform treatment is wasteful: most steps in a manipulation trajectory traverse free space and carry little task-relevant information, while a small fraction of key steps around contacts, grasps, and alignment demand dense, high-resolution prediction.
SkiP introduces an action relabeling mechanism: at each timestep in a skip segment, the behavior cloning target is replaced with the action at the entrance of the next key segment, enabling the policy to leap over redundant steps in a single decision. Motion Spectrum Keying (MSK) partitions demonstrations automatically from action signals. Across 72 simulated manipulation tasks and three real-robot tasks, SkiP reduces executed steps by 15-40% while matching or improving success rates across policy backbones.
Uniform timestep prediction is the bottleneck
Robot demonstrations are not equally informative at every timestep. A long free-space reach is mostly predictable, but the grasp, contact, and alignment phase needs dense correction. Standard behavior cloning treats both phases the same, so it spends decisions where little changes and compounds prediction errors along the way.
SkiP fixes this at the supervision level. In a skip segment, the target action is relabeled to the entrance of the next key segment; inside a key segment, the target remains the immediate next action. The result is a single policy with two behaviors: large, purposeful jumps when motion is redundant, and small refinements when precision matters.
Detect key moments, then relabel the target actions
Motion Spectrum Keying (MSK) finds high-information segments directly from the action signal. SkiP then changes only the supervision target: smooth free-space portions point to the next key moment, while contact-rich portions keep dense next-step control.
DCT energy highlights locally complex movement, while bend cues capture geometric changes that velocity alone can miss.
Contacts, grasps, and alignment windows stay as refinement regions; predictable transit is treated as skippable.
The same policy backbone learns two output modes: long jumps through free space and short corrections near precise manipulation.
Demonstrations are action sequences with non-uniform information density.
Short-window DCT coefficients expose local motion complexity.
Skip segments learn long jumps; key segments keep dense next-step control.
The improvement is a speed-accuracy trade-off, not just speed
SkiP often improves success because skipping removes intermediate predictions that would otherwise accumulate error. The strongest signal is that it cuts steps while moving toward the high-success, low-step region of the comparison plots.
RLBench-10 average success rate, compared with 0.43 for DP and 0.70 for CoA-rev.
RLBench-10 executed steps, down from 160.0 for DP and 127.5 for CoA-rev.
RoboMimic average SR using a Diffusion Policy UNet backbone, above CoA-rev's 0.723.
| Method | SR ↑ | Steps ↓ | Stepssucc ↓ | Rank ↓ |
|---|---|---|---|---|
| DP | 0.43 ± .02 | 160.0 ± 3 | 119.1 ± 2 | 5.4 |
| KF-only | 0.49 ± .01 | 113.1 ± 3 | 10.9 ± 3 | 4.3 |
| CoA-fwd | 0.68 ± .01 | 161.5 ± 1 | 133.3 ± 1 | 3.5 |
| ACT | 0.71 ± .01 | 139.4 ± 1 | 106.1 ± 1 | 3.1 |
| CoA-rev | 0.70 ± .05 | 127.5 ± 3 | 86.2 ± 3 | 3.1 |
| SkiP† | 0.30 ± .01 | 183.1 ± 2 | 97.6 ± 7 | 6.3 |
| SkiP | 0.85 ± .01 | 72.9 ± 2 | 43.4 ± 1 | 1.6 |
| Task | CoA-rev | CoA-fwd | SkiP† | SkiP |
|---|---|---|---|---|
| lift | 0.960 ± 0.016 | 1.000 ± 0.000 | 0.013 ± 0.019 | 1.000 ± 0.000 |
| can | 0.880 ± 0.016 | 0.873 ± 0.034 | 0.233 ± 0.009 | 0.827 ± 0.019 |
| square | 0.420 ± 0.016 | 0.327 ± 0.025 | 0.247 ± 0.019 | 0.673 ± 0.062 |
| transport | 0.633 ± 0.057 | 0.427 ± 0.047 | 0.000 ± 0.000 | 0.587 ± 0.041 |
| Average SR | 0.723 ± 0.013 | 0.657 ± 0.013 | 0.123 ± 0.002 | 0.772 ± 0.012 |
SkiP and baseline side by side in simulation
Representative RLBench evaluation videos compare the same task under SkiP and a baseline policy. The selected tasks cover drawer opening, box opening, and object grasping behaviors.
Open drawer
Open box
Pick up cup
On hardware, SkiP spends fewer decisions on transit and more near manipulation
With a fine-tuned vision-language-action backbone, SkiP improves real tabletop rollouts by using the same mechanism: jump through easy motion, then slow down around the manipulation phase.
The key observation is practical: free-space motion does not need the same prediction density as contact-rich phases. SkiP allocates decisions where the task needs them.
| Task | Base SR | SkiP SR | Base steps | SkiP steps | Base time (min:sec) |
SkiP time (min:sec) |
|---|---|---|---|---|---|---|
| pour-water | 40.0 | 46.7 | 290.4 | 265.4 | 3:44 | 3:28 |
| stack-bowls | 33.3 | 53.3 | 246.3 | 204.5 | 2:41 | 2:16 |
| tidy-up-desk | 66.7 | 73.3 | 250.7 | 207.4 | 2:20 | 2:00 |
The learned policy separates skip and refine actions
The ablations show that the segmentation signal matters. MSK beats random stride, velocity-only, and low-velocity alternatives, and the displacement analysis reveals the intended bimodal behavior.
During evaluation, SkiP calls are separated by displacement. Skip-mode calls produce large jumps, while key-mode calls stay near zero for local refinement.
| Task | RS | VO | LV | MSK (SkiP) |
|---|---|---|---|---|
| open-box | .76 ± .00 | .03 ± .02 | .76 ± .00 | .91 ± .02 |
| open-drawer | .72 ± .00 | .04 ± .00 | .96 ± .00 | 1.0 ± .00 |
| pick-up-cup | .23 ± .05 | .75 ± .04 | .81 ± .08 | .76 ± .06 |
| press-switch | .41 ± .05 | .79 ± .04 | .57 ± .05 | .53 ± .04 |
| push-button | .12 ± .00 | .65 ± .02 | 1.0 ± .00 | .99 ± .02 |
| reach-target | .64 ± .00 | .68 ± .00 | .31 ± .02 | .68 ± .00 |
| stack-wine | .77 ± .02 | .85 ± .02 | .88 ± .00 | 1.0 ± .00 |
| sweep-dustpan | .64 ± .00 | .04 ± .00 | 1.0 ± .00 | 1.0 ± .00 |
| take-lid-off | .60 ± .00 | .99 ± .02 | .71 ± .02 | .97 ± .02 |
| turn-tap | .31 ± .04 | .64 ± .03 | .73 ± .02 | .67 ± .02 |
| Average | .520 | .545 | .773 | .851 |
BibTeX
@misc{dai2026skip,
title = {SkiP: When to Skip and When to Refine for Efficient Robot Manipulation},
author = {Dai, Mingtong and Peng, Guanqi and Bai, Yongjie and Yan, Feng and Chen, Chunjie and Liu, Lingbo and Lin, Liang and Wu, Xinyu},
year = {2026},
eprint = {2605.15536},
archivePrefix = {arXiv},
primaryClass = {cs.RO},
doi = {10.48550/arXiv.2605.15536},
url = {https://arxiv.org/abs/2605.15536}
}