Cornelius Gruss

A Worn Skill-Capture System

A solo, four-month reimplementation of a worn skill-capture glove — hardware, SLAM, and policy, end-to-end.

V3 glove. Custom DAQ board on the back of the wrist, four MLX90393 Hall sensors at the joints, two VL53L1X time-of-flight sensors, and on-glove OV9782 stereo cameras (a relic from an earlier build — the production pose pipeline uses an externally-mounted GoPro).

V3 glove. Custom DAQ board on the back of the wrist, four MLX90393 Hall sensors at the joints, two VL53L1X time-of-flight sensors, and on-glove OV9782 stereo cameras (a relic from an earlier build — the production pose pipeline uses an externally-mounted GoPro).

This is a glove that records what your hands do, so a robot can learn to do it later. There's a custom DAQ board on the wrist with four magnetic sensors at the finger joints, two time-of-flight sensors at the fingertips, and a GoPro mounted to the back of the hand for camera pose. You put it on, press record, pour rice or stack a mug, and the recording becomes a 23-dimensional training example aligned to the video frame-by-frame. The form factor borrows from Sunday Robotics' published Skill Capture Glove; the SLAM pipeline is forked from UMI; the policy is ACT. Four months ago I'd never designed a circuit board.

I built this as the term project for ME740 (Vision, Robotics, and Planning) at Boston University. The course wanted a single deliverable; this writeup is the longer story.

One demonstration of mug-on-coaster (~15 s). 182 of these became the training set.

Why this project

I'd been reading UMI for a while. It's a handheld gripper with a wrist-mounted GoPro that produces training data for diffusion-policy imitation learning — no teleoperation rig, no sim-to-real, just a person doing the task with the gripper in their hand. The paper's framing is portable, low-cost, and information-rich data collection. The video kept replaying in my head. That's the kind of robotics I want to work on.

What stood out was how much has to be right at once. Hardware, SLAM, alignment, policy. You can read about each layer separately and still not understand what makes the whole thing work. I figured the way to learn it was to build all of it.

Sunday Robotics emerged from stealth at the end of 2025 with a glove built for the same task — finger joint sensing instead of UMI's single-DoF gripper. The form factor was public; the internals weren't. So I started from the public photos and the paper, scoped the project at the full stack, and built a version of my own.

What I built

The capture pipeline is five stages. A person wears the glove, presses record on the GoPro, performs the task, and stops. Two recorders run independently — the GoPro writes 60 fps video to its own SD card; a Python recorder on the laptop streams 30 Hz of joint-sensor and time-of-flight data over USB into a per-day master tape. Both clocks share UTC, set by pointing the camera at GoPro Labs' precision-time QR page once per session. After capture, an ingest stage slices the master tape to each demo, the SLAM stage produces a per-frame 6-DoF pose, and an alignment stage interpolates the proprioceptive streams onto the video timeline. The output is one Parquet file per demonstration, with an identical schema for every demo recorded in the project.

The five-stage capture pipeline. The SLAM stage changed eight times during the project; the inter-stage schema stayed constant.
The five-stage capture pipeline. The SLAM stage changed eight times during the project; the inter-stage schema stayed constant.

The hardware is two parts. The DAQ board sits on the back of the wrist and runs an STM32G431 — bit-bang I²C against a TCA9548A multiplexer, four MLX90393 Hall sensors on thin FR4 PCBs at the index PIP, index MCP, three-finger PIP, and three-finger MCP joints, two VL53L1X time-of-flight sensors, and a QMI8658C IMU. Output is a single USB CDC stream at two packet rates: proprioception at 30 Hz (interpolated against the GoPro's 60 fps frame timeline at align time) and IMU at ~1 kHz nominal. The fingers are 3D-printed PLA with a deliberate axial slit at each PIP joint — about 0.5 mm of designed-in compliance that lets fingertip load translate the joint magnet axially toward the sensor. That is how the glove gets force without a load cell.

The DAQ board. STM32G431 + TCA9548A I²C mux + QMI8658C IMU. Four MLX90393 Hall sensor PCBs and two VL53L1X ToF sensors connect via JST headers around the perimeter. Designed in KiCad, manufactured at JLCPCB.
The DAQ board. STM32G431 + TCA9548A I²C mux + QMI8658C IMU. Four MLX90393 Hall sensor PCBs and two VL53L1X ToF sensors connect via JST headers around the perimeter. Designed in KiCad, manufactured at JLCPCB.

The policy is ACT — Action-Chunked Transformer, the architecture from ALOHA. I picked ACT over diffusion policy partly because 182 demos is a small dataset and ACT's deterministic chunked prediction is more sample-efficient than per-step diffusion, and partly because the mug-on-coaster task routinely has the coaster out of frame mid-pickup, and ACT's 5-second action horizon lets the policy plan past those gaps even though its observation horizon is only two frames. The encoder is a CVAE; the decoder is a 7-layer DETR-style transformer over a frozen DINOv2 ViT-S/14 backbone. 38.7M parameters, 16.6M trainable. The whole training stack runs on the BU Shared Computing Cluster on a single L40S.

Eight SLAM backends

The SLAM stage changed eight times in four months. That was not the plan, and the plan wouldn't have been better than what actually happened.

The reason it kept moving is that I was training a policy on absolute poses. If the trajectories aren't metrically consistent across sessions, the model is fitting label noise and no training-side fix gets it back. So I needed a SLAM stack that produced consistent, metric trajectories on the actual demonstration data — not on a benchmark, not on someone else's video. Each backend was a different attempt at that.

I started with DROID-SLAM because it is the dense visual-only SLAM most likely to "just work" on monocular video. It crashed on Colab — lietorch CUDA timeout, GPU SIFT failures, the cloud-only stack didn't survive a real session. Production stages should not depend on uncached cloud-GPU dependencies.

I switched to COLMAP SfM, ran it offline against 17 demo sessions, and saw a 25.4% cross-session scale variation in the position RMSE — a metric-scale problem that no policy training can recover from. COLMAP without inertial fusion can't give a consistent metric across sessions.

Next was custom stereo SIFT triangulation — 1.2 m → 1.6 m on the ruler test (33% scale error). Better than monocular, still not metric. Triangulation alone is insufficient; you need full visual-inertial fusion.

That sent me to OKVIS2-X, a production-grade stereo VIO. Calibrated with Kalibr (0.81 px reprojection), worked on the bare boards. After I assembled the gloves with the fingers on, OKVIS2 started failing — finger occlusion ate the stereo overlap. The 81° HFOV cameras had only 36° clean overlap once the fingers were in the field of view, and overlap collapsed to negative numbers when fingers approached the centerline. I tried CIL237 fisheye lenses (122° HFOV, 77° overlap) and OKVIS2 still struggled. The geometry was not a tuning problem; the cameras were on the wrong rigid body.

Three sentences hide a lot of grinding. Each Kalibr session is a waving routine in front of an AprilGrid. My house had AprilGrid sheets taped to monitors, walls, and bookshelves for two months. The first ten calibrations were getting Kalibr to converge; the next ten were chasing reprojection numbers I didn't trust. By the end of the OKVIS2 era I knew the config file by heart and didn't believe any of it.

I tried custom GTSAM v1, v2, v3 — three iterations of in-house factor graphs with CoTracker keypoints and IMU pre-integration. v2 converged on one hand-picked session at 0.5% scale error. It didn't converge on the larger dataset; landmarks were frozen at first observation and being reused across sessions without re-optimization. The fix would have been a bundle adjustment pass over the whole map. I never built it.

Six weeks of OV9782 work, three custom factor graphs, one Hero 5 detour, and a running total of calibrations I'd thrown out — and I finally accepted the cameras weren't the problem to fix. I pivoted to GoPro Hero 10 + UMI's published pipeline (ORB-SLAM3 + Kalibr Docker image, no recalibration needed). First mapping run: 99.6% tracking on an 83-second clip. First demo localized against that map: 99.5%, demo start landing 1.2 cm from where the mapping trajectory ended.

The camera was the bottleneck, not the algorithm. I moved the OV9782 code into an _archive/ folder and didn't look back.

ORB-SLAM3 hit a ceiling I couldn't tune past. The community reproduction rate for UMI's ORB-SLAM3 stage hovers around 20%, the chicheng/orb_slam3 Docker image is closed-source, and every 2024–2026 UMI successor paper (FastUMI, UMI-3D, ActiveUMI) replaces the SLAM stage. So I did the same.

The current production stack is HLocHierarchical-Localization: SuperPoint keypoints, LightGlue matching, NetVLAD retrieval, COLMAP for the map model — plus a pose-graph batch smoother and a final Gaussian σ=5 filter. Same two-stage architecture as ORB-SLAM3 (build a map once, localize each demo frame against it), but learned features with retrieval are materially more robust on hard scenes. On three demos that ORB-SLAM3 had failed or railed on, HLoc localized 100% of frames on all three.

The pattern across the eight backends was consistent: each one failed for a structural reason, not a tuning reason, and the failure mode pointed at the next stack. I kept the downstream interface stable through all of it — every backend wrote the same camera_trajectory.csv schema, so the alignment and training stages didn't move while the SLAM stage churned underneath them.

What works

Three pieces of evidence I trust. None of them is "policy succeeds at task on a real robot" — that is Phase 2 — but each one is necessary for the eventual robot to have a chance.

SLAM versus a Kalibr AprilGrid. I built a controlled validation harness using a 6×6 AprilGrid with 25.8 mm tags as the truth source. Six clips total — one for the SfM map, five for validation — covering static no-cup, static cup-in-hand, and slow board-circle scenarios. On static no-cup frames the production stack hits 8.3 mm p95 error against ground truth; the AprilGrid's own physical noise floor on those frames is 8.1 mm p95. In this validation regime, the residual error sits at the AprilGrid noise floor — further gains on the same regimes would have to come from capture-side hardware (higher-resolution sensor, denser workspace texture, more SfM coverage of demo viewpoints), not the SLAM stack itself. Fast pickup-and-place motion isn't separately bounded against ground truth; that's flagged in the limits below.

Force-correlated Hall signal on compliant joints. Each PIP joint has the deliberate axial slit so fingertip load translates the joint magnet toward the sensor; the magnetometer's Bz channel reads the gap change. I ran a controlled scale-press experiment — 20 ramped presses plus 20 random-angle presses per assembly, 300–1300 g range, on both a compliant-linkage assembly (3F PIP) and a rigid-linkage assembly (Index MCP) as a negative control. The 3F PIP joint gives a per-press linear fit of +1.29 counts/g, R² = 0.64. The Index MCP is flat under load, which is the expected failure of a linkage too rigid to translate the magnet axially. The correlation is real on the joints with the slit-compliance design — narrow scope, but a real signal. Extending it to MCP needs the V4 air-gap revision.

The PIP joint, opened up. About 0.5 mm of axial slit gives the joint its compliance — fingertip load translates the magnet toward the Hall sensor underneath, and Bz reads the gap change.
The PIP joint, opened up. About 0.5 mm of axial slit gives the joint its compliance — fingertip load translates the magnet toward the Hall sensor underneath, and Bz reads the gap change.

ACT policy on 182 demonstrations. Trained on the BU SCC: four visually distinct workspaces, kl_weight 50, 5-second action horizon, image augmentation, is_pad masking, frozen DINOv2 backbone, 150-epoch budget with the best checkpoint at epoch 38. The production checkpoint reaches 28.8 mm position L1, 6.31° rotation L1, 736 LSB Hall L1 on a 27-session held-out validation set, scored against the HLoc + σ=5 ground truth. The position error is roughly 3.5× the truth-source's own measurement floor — the model is the dominant residual, not label noise.

The production policy predicting the next 5-second action chunk on a held-out demo. Predicted Hall traces (right panel) track the squeeze cycle through the grasp event.

The training set is what survived a label-quality audit, plus a top-up the day after. The audit's trajectory-smoothness gate excluded 40 of the original 172 candidate sessions — 23% — for unphysical single-frame jumps that turned out to be silent SLAM pose-graph re-anchoring artifacts. I recorded 50 additional clean demonstrations the following day, bringing the final training set to 182. Tracking percentage doesn't catch the audit-flagged jumps; the SLAM tracker reports is_lost = False on the offending frames. A scrubbable review dashboard, shipped alongside the pipeline, is what surfaced them frame by frame.

The review dashboard. Per-demo trajectory-quality metrics on a scrubbable timeline alongside the GoPro video. This is the tool that caught the 23% mislabeled fraction.
The review dashboard. Per-demo trajectory-quality metrics on a scrubbable timeline alongside the GoPro video. This is the tool that caught the 23% mislabeled fraction.

What's missing

The honest version of the limits, in order of how much they bite.

  • No actuated arm. This is offline imitation learning trained on human demonstrations. The metric is L1 against trajectory ground truth, not task-completion rate on a real robot. Deployment on a real arm — UR-class or similar, since the actuated version of the glove is heavier than a hobby arm can carry — would convert the offline metric into a behavior, and might expose modeling gaps the held-out validation didn't.
  • Force sensing is PIP-only. The compliant-linkage joints work; the rigid-linkage MCP joints need a V4 air-gap revision before they can carry force. The claim is correctly narrowed in the paper.
  • Single task, single operator, single map per scene. All 182 demonstrations were recorded by a single operator on one task across four workspaces; cross-task and cross-wearer generalization weren't measured. HLoc localizes against a per-environment SfM map, so a new scene needs a fresh mapping clip and a rebuilt atlas.
  • AprilGrid bound is on static and slow-motion frames. The 8.3 mm p95 SLAM-versus-ground-truth figure was measured on static-no-cup, static-cup-in-hand, and slow board-circle scenarios. Fast pickup-and-place motions weren't separately bounded against ground truth.
  • Fingertip geometry isn't optimized for fine manipulation. I tried a handful of mug-on-coaster-style tasks during development; some worked clean and some were stubborn — the difference came down to how the rounded fingertip pads contacted small or thin parts. Different pad geometries (sharper edges for fine pinches, softer compliance for delicate parts, maybe interchangeable inserts) are the obvious next mechanical experiment.

Built with

UMI (Cheng Chi et al., RSS 2024) — the Hero 10 + Docker stack the production pose pipeline forked from. Hierarchical-Localization, SuperPoint, LightGlue, NetVLAD, COLMAP — production SLAM stack. ACT (Tony Zhao et al., RSS 2023) — policy architecture. DINOv2 — frozen ViT-S/14 visual backbone. GoPro Labs precision-time QR — millisecond camera-to-laptop sync. Final project for ME740 (Vision, Robotics, and Planning, Boston University, Spring 2026), under Prof. John Baillieul.

More work