How to Peel with a Knife:

Aligning Fine-Grained Manipulation with Human Preference

UC Berkeley

How to Peel with a Knife:
Aligning Fine-Grained Manipulation with Human Preference

Abstract

Many essential manipulation tasks - such as food preparation, surgery, and craftsmanship - remain intractable for autonomous robots. These tasks are characterized not only by contact-rich, force-sensitive dynamics, but also by their "implicit" success criteria: unlike pick-and-place, task quality in these domains is continuous and subjective (e.g. how well a potato is peeled), making quantitative evaluation and reward engineering difficult. We present a learning framework for such tasks, using peeling with a knife as a representative example. Our approach follows a two-stage pipeline: first, we learn a robust initial policy via force-aware data collection and imitation learning, enabling generalization across object variations; second, we refine the policy through preference-based finetuning using a learned reward model that combines quantitative task metrics with qualitative human feedback, aligning policy behavior with human notions of task quality. Using only 50-200 peeling trajectories, our system achieves over 90% average success rates on challenging produce including cucumbers, apples, and potatoes, with performance improving by up to 40% through preference-based finetuning. Remarkably, policies trained on a single produce category exhibit strong zero-shot generalization to unseen in-category instances and to out-of-distribution produce from different categories while maintaining over 90% success rates.

Overview

We use a 7-DoF Kinova Gen3 arm with impedance control. A custom designed mount holding a knife is attached to the tool end. Two wrist cameras are attached to the tool end and pointing towards the knife and produce. We collect data on three types of produce, train peeling policies that zero-shot generalize to six types of produce with a wide range of geometries and surface physical properties, and finetune the policies to align with human preference of peel quality.

Method

We propose a two-stage learning framework for fine-grained, force-sensitive manipulation that combines compliant data collection, force-aware policy learning, and preference-based finetuning within a scalable pipeline for multi-modal data and training. Our approach first integrates visual, proprioceptive, and force-torque sensing to learn a generalizable base diffusion policy. We then introduce a preference-based reward model learned from human feedback to effectively refine the base policy for high-quality performance on real robotic systems.

Comparison with Base Policy

We finetune the peeling policy by freezing the base diffusion policy and learning a residual policy that predicts action corrections guided by human preference. Our approach achieves over 90% average success rates, with performance improving by up to 40% after finetuning.

Robustness to In-Category Variations

We evaluate peeling policies across various produce categories. For produce types encountered during training, we assess the policy's ability to generalize to novel start poses and diverse instances characterized by variations in size, geometry, compliance, and surface texture. The learned policies maintain consistent peeling quality despite these differences.

Cross-Category Produce Generalization

We evaluate zero-shot generalization by deploying policies trained on a single produce category directly onto unseen produce types without further training or supervision. These evaluations subject the system to substantial domain shifts in geometry, surface texture, and mechanical properties. Despite these out-of-distribution challenges, the learned policies exhibit remarkable transferability, maintaining stable contact, adaptive force modulation, and consistent task execution across entirely novel produce classes.

Qualitative Reward Metrics

Qualitative labels are assigned at the trajectory level to capture holistic human judgment of peeling quality. While specific parameters -- such as peel thickness -- can be measured objectively, critical factors like surface smoothness, motion continuity, and the absence of artifacts are inherently difficult to quantify through traditional metrics. To capture these subjective nuances, human annotators evaluate complete peeling episodes based on overall visual fidelity, assigning a Likert-type ordinal score. This process yields a global, human-aligned supervision signal that bridges the gap between raw sensor data and perceived task excellence.

Failure Cases

We systematically categorize and analyze characteristic failure modes. Beyond low-level execution errors -- such as the blade engaging too deeply or failing to maintain contact (qualitative scores 1 and 2) -- most failures during cross-category generalization appear to stem from significant domain divergence. For instance, we document the performance degradation when transferring a cucumber-trained policy to apples or a potato-trained policy to cucumbers. While zero-shot transfer across morphologically distinct classes remains a formidable challenge, the boundary of generalizability is influenced by a complex interplay of geometric and mechanical factors, representing a compelling direction for future investigation.

Bibtex


        @article{lin2026how,
          author={Lin, Toru and Deng, Shuying and Yin, Zhao-Heng and Abbeel, Pieter and Malik, Jitendra},
          title={How to Peel with a Knife: Aligning Fine-Grained Manipulation with Human Preference},
          journal={arXiv:2603.03280},
          year={2026}
        }
        

Acknowledgements

We thank the authors of the open-source repositories that informed our implementation of the compliant controller on the Kinova Gen3 (tidybot2, compliant_controllers, gen3_compliant_controllers) for their technical guidance. We are grateful to Yifan Hou for initial advice on compliant controller implementation and for sharing resources related to mount design. We also thank Pingchuan Ma for assistance with photography and videography.