How to Peel with a Knife: Aligning Fine-Grained Manipulation with Human Preference

Abstract

Many essential manipulation tasks - such as food preparation, surgery, and craftsmanship - remain intractable for autonomous robots. These tasks are characterized not only by contact-rich, force-sensitive dynamics, but also by their "implicit" success criteria: unlike pick-and-place, task quality in these domains is continuous and subjective (e.g. how well a potato is peeled), making quantitative evaluation and reward engineering difficult. We present a learning framework for such tasks, using peeling with a knife as a representative example. Our approach follows a two-stage pipeline: first, we learn a robust initial policy via force-aware data collection and imitation learning, enabling generalization across object variations; second, we refine the policy through preference-based finetuning using a learned reward model that combines quantitative task metrics with qualitative human feedback, aligning policy behavior with human notions of task quality. Using only 50-200 peeling trajectories, our system achieves over 90% average success rates on challenging produce including cucumbers, apples, and potatoes, with performance improving by up to 40% through preference-based finetuning. Remarkably, policies trained on a single produce category exhibit strong zero-shot generalization to unseen in-category instances and to out-of-distribution produce from different categories while maintaining over 90% success rates.

Method

We propose a two-stage learning framework for fine-grained, force-sensitive manipulation that combines compliant data collection, force-aware policy learning, and preference-based finetuning within a scalable pipeline for multi-modal data and training. Our approach first integrates visual, proprioceptive, and force-torque sensing to learn a generalizable base diffusion policy. We then introduce a preference-based reward model learned from human feedback to effectively refine the base policy for high-quality performance on real robotic systems.

Comparison with Base Policy

We finetune the peeling policy by freezing the base diffusion policy and learning a residual policy that predicts action corrections guided by human preference. Our approach achieves over 90% average success rates, with performance improving by up to 40% after finetuning.

Robustness to In-Category Variations

We evaluate peeling policies across various produce categories. For produce types encountered during training, we assess the policy's ability to generalize to novel start poses and diverse instances characterized by variations in size, geometry, compliance, and surface texture. The learned policies maintain consistent peeling quality despite these differences.

Cross-Category Produce Generalization

We evaluate zero-shot generalization by deploying policies trained on a single produce category directly onto unseen produce types without further training or supervision. These evaluations subject the system to substantial domain shifts in geometry, surface texture, and mechanical properties. Despite these out-of-distribution challenges, the learned policies exhibit remarkable transferability, maintaining stable contact, adaptive force modulation, and consistent task execution across entirely novel produce classes.

Qualitative Reward Metrics

Qualitative labels are assigned at the trajectory level to capture holistic human judgment of peeling quality. While specific parameters -- such as peel thickness -- can be measured objectively, critical factors like surface smoothness, motion continuity, and the absence of artifacts are inherently difficult to quantify through traditional metrics. To capture these subjective nuances, human annotators evaluate complete peeling episodes based on overall visual fidelity, assigning a Likert-type ordinal score. This process yields a global, human-aligned supervision signal that bridges the gap between raw sensor data and perceived task excellence.

Failure Cases

We systematically categorize and analyze characteristic failure modes. Beyond low-level execution errors -- such as the blade engaging too deeply or failing to maintain contact (qualitative scores 1 and 2) -- most failures during cross-category generalization appear to stem from significant domain divergence. For instance, we document the performance degradation when transferring a cucumber-trained policy to apples or a potato-trained policy to cucumbers. While zero-shot transfer across morphologically distinct classes remains a formidable challenge, the boundary of generalizability is influenced by a complex interplay of geometric and mechanical factors, representing a compelling direction for future investigation.

How to Peel with a Knife:

Aligning Fine-Grained Manipulation with Human Preference

How to Peel with a Knife:
Aligning Fine-Grained Manipulation with Human Preference