Imitation learning for robot manipulation

SkiP: When to Skip and When to Refine for Efficient Robot Manipulation

Without changing the model architecture, SkiP reshapes the training targets so one policy learns to skip predictable motion and reserve precise control for contact-rich segments.

Mingtong Dai1,2,6 Guanqi Peng2,3 Yongjie Bai2,4 Feng Yan5 Chunjie Chen1 Lingbo Liu*2 Liang Lin2,4 Xinyu Wu1
1SIAT 2PCL 3SUSTech 4SYSU 5UNT 6UCAS
SkiP teaser showing skip and refine behavior across a robot manipulation trajectory
Abstract

Previous imitation learning policies predict future actions at every control step, whether in smooth motion phases or precise, contact-rich operation phases. This uniform treatment is wasteful: most steps in a manipulation trajectory traverse free space and carry little task-relevant information, while a small fraction of key steps around contacts, grasps, and alignment demand dense, high-resolution prediction.

SkiP introduces an action relabeling mechanism: at each timestep in a skip segment, the behavior cloning target is replaced with the action at the entrance of the next key segment, enabling the policy to leap over redundant steps in a single decision. Motion Spectrum Keying (MSK) partitions demonstrations automatically from action signals. Across 72 simulated manipulation tasks and three real-robot tasks, SkiP reduces executed steps by 15-40% while matching or improving success rates across policy backbones.

15-40% fewer executed steps while preserving success
0.85 RLBench-10 SR, up from 0.70 with CoA-rev
72.9 RLBench-10 steps, down from 127.5 with CoA-rev
+14.9 pp average SR gain over CoA-rev on RLBench-10
Core idea

Uniform timestep prediction is the bottleneck

Robot demonstrations are not equally informative at every timestep. A long free-space reach is mostly predictable, but the grasp, contact, and alignment phase needs dense correction. Standard behavior cloning treats both phases the same, so it spends decisions where little changes and compounds prediction errors along the way.

SkiP fixes this at the supervision level. In a skip segment, the target action is relabeled to the entrance of the next key segment; inside a key segment, the target remains the immediate next action. The result is a single policy with two behaviors: large, purposeful jumps when motion is redundant, and small refinements when precision matters.

What changes? The behavior cloning target, not the model architecture.
What does it learn? A skip mode for low-information motion and a refine mode near contacts.
Why it helps? Fewer policy calls means less error accumulation and faster execution.
Method

Detect key moments, then relabel the target actions

Motion Spectrum Keying (MSK) finds high-information segments directly from the action signal. SkiP then changes only the supervision target: smooth free-space portions point to the next key moment, while contact-rich portions keep dense next-step control.

Overview of Motion Spectrum Keying and SkiP action relabeling
Overview of SkiP: spectral analysis identifies key segments, then action targets are relabeled to produce skip and refine behavior.
01
Analyze motion spectrum

DCT energy highlights locally complex movement, while bend cues capture geometric changes that velocity alone can miss.

02
Keep key segments dense

Contacts, grasps, and alignment windows stay as refinement regions; predictable transit is treated as skippable.

03
Rewrite imitation targets

The same policy backbone learns two output modes: long jumps through free space and short corrections near precise manipulation.

Trajectory
\[ \tau = \{(o_t, a_t)\}_{t=1}^{T}, \qquad a_t \in \mathbb{R}^{d} \]

Demonstrations are action sequences with non-uniform information density.

Motion Spectrum Keying
\[ c_{t,k} = \sum_{n=0}^{W-1} \alpha_k v_{t,n} \cos\!\left(\frac{\pi(2n+1)k}{2W}\right) \]

Short-window DCT coefficients expose local motion complexity.

Action relabeling
\[ \tilde a_t = \begin{cases} a_{\operatorname{nextKey}(t)}, & t \in \mathcal{S}_{skip} \\ a_{t+1}, & t \in \mathcal{S}_{key} \end{cases} \]

Skip segments learn long jumps; key segments keep dense next-step control.

Results

The improvement is a speed-accuracy trade-off, not just speed

SkiP often improves success because skipping removes intermediate predictions that would otherwise accumulate error. The strongest signal is that it cuts steps while moving toward the high-success, low-step region of the comparison plots.

Main simulated benchmark 0.85 SR

RLBench-10 average success rate, compared with 0.43 for DP and 0.70 for CoA-rev.

Execution efficiency 72.9 steps

RLBench-10 executed steps, down from 160.0 for DP and 127.5 for CoA-rev.

Cross-backbone transfer 0.772

RoboMimic average SR using a Diffusion Policy UNet backbone, above CoA-rev's 0.723.

What to read from the figures: SkiP is not only a faster rollout trick. The relabeling changes the learned action distribution, producing large jumps in skip mode and small corrections in refine mode.
RLBench-10 aggregate results
Method SR ↑ Steps ↓ Stepssucc Rank ↓
DP0.43 ± .02160.0 ± 3119.1 ± 25.4
KF-only0.49 ± .01113.1 ± 310.9 ± 34.3
CoA-fwd0.68 ± .01161.5 ± 1133.3 ± 13.5
ACT0.71 ± .01139.4 ± 1106.1 ± 13.1
CoA-rev0.70 ± .05127.5 ± 386.2 ± 33.1
SkiP†0.30 ± .01183.1 ± 297.6 ± 76.3
SkiP0.85 ± .0172.9 ± 243.4 ± 11.6
RoboMimic transfer results
Task CoA-rev CoA-fwd SkiP† SkiP
lift0.960 ± 0.0161.000 ± 0.0000.013 ± 0.0191.000 ± 0.000
can0.880 ± 0.0160.873 ± 0.0340.233 ± 0.0090.827 ± 0.019
square0.420 ± 0.0160.327 ± 0.0250.247 ± 0.0190.673 ± 0.062
transport0.633 ± 0.0570.427 ± 0.0470.000 ± 0.0000.587 ± 0.041
Average SR0.723 ± 0.0130.657 ± 0.0130.123 ± 0.0020.772 ± 0.012
Action relabeling diagram for SkiP skip and key segments
Action relabeling turns dense next-step supervision into skip-segment targets.
Success-rate versus step-count trade-off across RLBench tasks
SkiP moves toward the high-success, low-step region of the trade-off plot.
Per-task success curves on RLBench-50
Per-task success rates on RLBench-50 show smoother degradation for SkiP across long-horizon tasks.
RLBench Rollouts

SkiP and baseline side by side in simulation

Representative RLBench evaluation videos compare the same task under SkiP and a baseline policy. The selected tasks cover drawer opening, box opening, and object grasping behaviors.

Open drawer

SkiP
Baseline

Open box

SkiP
Baseline

Pick up cup

SkiP
Baseline
Real Robot

On hardware, SkiP spends fewer decisions on transit and more near manipulation

With a fine-tuned vision-language-action backbone, SkiP improves real tabletop rollouts by using the same mechanism: jump through easy motion, then slow down around the manipulation phase.

46.7%pour-water SR, up from 40.0%
53.3%stack-bowls SR, up from 33.3%
2:00tidy-up-desk wall-clock time, down from 2:20

The key observation is practical: free-space motion does not need the same prediction density as contact-rich phases. SkiP allocates decisions where the task needs them.

Real-robot fine-tuning
Task Base SR SkiP SR Base steps SkiP steps Base time
(min:sec)
SkiP time
(min:sec)
pour-water 40.0 46.7 290.4 265.4 3:44 3:28
stack-bowls 33.3 53.3 246.3 204.5 2:41 2:16
tidy-up-desk 66.7 73.3 250.7 207.4 2:20 2:00
Real robot rollout examples for tidy-up-desk, pour-water, and stack-the-bowls
Real-robot rollout examples for tidy-up-desk, pour-water, and stack-the-bowls.
Analysis

The learned policy separates skip and refine actions

The ablations show that the segmentation signal matters. MSK beats random stride, velocity-only, and low-velocity alternatives, and the displacement analysis reveals the intended bimodal behavior.

Skip/refine evidence
\[ d_{\mathrm{jump}} = \lVert a_1 - p_{\mathrm{ee}} \rVert_2, \qquad \mathcal{A} = \mathcal{A}_{\mathrm{skip}} \cup \mathcal{A}_{\mathrm{key}} \]

During evaluation, SkiP calls are separated by displacement. Skip-mode calls produce large jumps, while key-mode calls stay near zero for local refinement.

MSK is the useful supervision signal It captures motion curvature and segment placement, instead of relying only on stride or raw velocity.
The output becomes bimodal The policy learns when to leap through free space and when to slow down near contact-rich parts.
Label source ablation on RLBench-10
RS marks fixed periodic key segments, matching MSK's approximate key-step ratio. VO treats high-velocity timesteps as key, which can miss slow contact-rich manipulation. LV treats low-velocity timesteps as key, a stronger heuristic for careful manipulation. MSK uses motion-spectrum energy with bend and keyframe cues to place key segments.
Task RS VO LV MSK (SkiP)
open-box.76 ± .00.03 ± .02.76 ± .00.91 ± .02
open-drawer.72 ± .00.04 ± .00.96 ± .001.0 ± .00
pick-up-cup.23 ± .05.75 ± .04.81 ± .08.76 ± .06
press-switch.41 ± .05.79 ± .04.57 ± .05.53 ± .04
push-button.12 ± .00.65 ± .021.0 ± .00.99 ± .02
reach-target.64 ± .00.68 ± .00.31 ± .02.68 ± .00
stack-wine.77 ± .02.85 ± .02.88 ± .001.0 ± .00
sweep-dustpan.64 ± .00.04 ± .001.0 ± .001.0 ± .00
take-lid-off.60 ± .00.99 ± .02.71 ± .02.97 ± .02
turn-tap.31 ± .04.64 ± .03.73 ± .02.67 ± .02
Average.520.545.773.851
Action displacement distributions across RLBench tasks
Action displacement distributions show SkiP's two modes: large skip jumps and near-zero local refinement.
Citation

BibTeX

@misc{dai2026skip,
  title         = {SkiP: When to Skip and When to Refine for Efficient Robot Manipulation},
  author        = {Dai, Mingtong and Peng, Guanqi and Bai, Yongjie and Yan, Feng and Chen, Chunjie and Liu, Lingbo and Lin, Liang and Wu, Xinyu},
  year          = {2026},
  eprint        = {2605.15536},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  doi           = {10.48550/arXiv.2605.15536},
  url           = {https://arxiv.org/abs/2605.15536}
}