Teaching a Robot to Learn from My Hands

January 2026 - Present

Second semester of my master's in Robotics at BU, I looked at my project history and noticed a gap. I'd built drones, designed rocket valves, worked on reinforcement learning for manufacturing. But I'd never actually built anything in robot learning. Not the kind where a robot watches what you do and figures out how to do it itself.

I came across UMI from Cheng Chi's group. They built a system where you hold a gripper, do a task like putting a shirt on a hanger or scooping food, and the robot learns to reproduce it from those demonstrations. No expensive teleoperation arms, no sim-to-real transfer. Just a person doing the task with their hands. I watched their demo videos and kept thinking about it for days.

The approach they use is called diffusion policy. The idea comes from image generation: models like DALL-E and Stable Diffusion work by learning to remove noise. You start with pure static and gradually denoise it into a coherent picture. Diffusion policy does the same thing, but generates sequences of robot actions instead of pixels. Given what the robot currently sees and feels, it denoises a trajectory of future movements. The reason this works well for manipulation is that there are usually multiple valid ways to do a task. There's more than one good way to pick up a cup. Simpler models tend to average between the options and produce a motion that doesn't work at all. Diffusion models can represent that ambiguity and commit to one coherent trajectory.

I wanted to understand how the full pipeline works, not just read about it. So I decided to build it from scratch: the data collection hardware, the processing pipeline, and the policy training.

The Glove

The dream for demonstration data is simple: just use your hands and let cameras figure out the rest. No sensors strapped to you, nothing to wear. But reliable hand tracking from vision alone isn't there yet, at least not at the precision you need for training a manipulation policy. So for now, you instrument the hand.

What drew me to UMI was that it doesn't require a second set of robot arms to collect data. You just hold a gripper and do the task. But their system captures the hand as a single number: gripper width. Open or closed, and everything in between. That's fine for tasks where you're just clamping onto things, but humans don't have grippers. We have hands. We pinch, wrap, hook, press with a fingertip. If you want a robot to eventually do anything beyond basic pick-and-place, the demonstration data needs to capture more of what the hand is actually doing. Compressing all of that into one scalar throws away information that matters.

So I built a glove with four magnetic encoders on the finger joints, two stereo cameras, an IMU, and two time-of-flight distance sensors. My observation space is 11D (pose + four independent joint angles) compared to UMI's 8D (pose + gripper width). Whether that extra information actually helps the policy learn better is something I still need to test. But the hypothesis is that richer proprioception should let the policy distinguish between grasp types and learn tasks that a simple gripper can't express.

The hardest part of the hardware was the MCP joint, the knuckle where each finger meets the hand. On a human hand, there's flesh between the index finger and the other three fingers. You can't bridge a structural element across that gap. I spent a lot of time in my head designing elaborate solutions, thinking about bearings and complex assemblies to get stability. And then the final design that actually works is incredibly simple: the encoder mount cantilevers from one side, the axis of rotation has a slight offset to clear the neighboring finger, and that's it. No bearings, easy to assemble. I could have gotten there a lot faster if I'd just started printing earlier instead of trying to solve it in my head. The final version looks obvious in hindsight, but that's because the complexity went into the iteration, not the design.

Seeing in 3D

The first version of this glove used an iPhone for video and offline SLAM for pose reconstruction. It worked, technically. But monocular SLAM can't recover metric scale. My trajectories had significant scale variation between sessions. If the model thinks a movement is ten centimeters but it's actually fifteen, you get a policy that consistently overshoots or undershoots. That's not a tuning problem, it's a data problem, and no amount of training fixes it.

So I rebuilt the sensing around stereo vision. Two global shutter cameras with a known baseline, plus the IMU, fed into OKVIS2-X, a stereo visual-inertial odometry system that produces metric-scale pose estimates. Getting the cameras and IMU calibrated was its own project. I used Kalibr for extrinsic calibration, which involves waving the glove in front of an AprilTag grid while it estimates the spatial and temporal offsets between every sensor. The result is consistent metric-scale tracking across sessions, which is what the policy actually needs.

The Pipeline

Recording a demonstration means putting on the glove, pressing record in a dashboard I built, doing the task, and pressing stop. The system captures stereo frames, IMU data, encoder readings, and distance measurements, all timestamped in EuRoC format.

After recording, I run the data through OKVIS2-X to get pose trajectories, transform them into a consistent reference frame, and package everything into a training dataset. The policy is a conditional denoising diffusion model based on Chi et al.'s architecture: dual ResNet-18 encoders for the stereo camera feeds, concatenated with the proprioceptive state, conditioning a 1D U-Net that denoises action trajectories.

There's a gap between reading a methods section in a paper and having a working system. Synchronizing sensor streams across different clock domains, getting coordinate frame transforms correct between the calibration output and what the VIO system expects, handling format conversions. None of it is conceptually hard, but it's where most of the actual development time goes.

What's Next

I'm currently collecting demonstrations and working on the first training runs with real data. On the hardware side, I'm designing a custom PCB, a Pi Zero 2W HAT that replaces the current breadboard setup with direct sensor reads, onboard battery, and a USB hub for the cameras. Once that board arrives, the glove becomes fully untethered.

The code is open source: github.com/corneliusgruss/diffusion-policy-from-scratch