Transformer-Based ASL Fingerspelling to Text
Sign language users communicate through hand shapes and movements, but most digital systems cannot understand these gestures. This system automatically translates American Sign Language fingerspelling — where each hand shape represents a letter — into written English text. It first tracks the 3D positions of hands and upper body joints in video using MediaPipe, then feeds these landmark sequences into a Transformer encoder-decoder that recognizes individual letters and detects word boundaries. Validated at a character error rate of ≈ 0.36 (64% character accuracy), representing a meaningful step toward automated sign language accessibility.
Problem & users
Fingerspelling conveys names and terms without dedicated signs. Recognition is difficult due to fast transitions and co-articulation between letters. The goal is letter-accurate transcription with clear handling of similar handshapes and consistent preprocessing across signers and cameras.
Solution overview
The system replaces raw pixels with landmarks and frames the task as sequence-to-sequence prediction. MediaPipe provides 3D keypoints per frame. A Transformer captures temporal context and outputs letters A–Z plus an end token. The pipeline includes careful feature selection and normalization for stable learning.
Data & feature spec
Dataset
Google American Sign Language Fingerspelling dataset with precomputed MediaPipe landmarks. Examples are landmark sequences paired with target words.
Feature selection (F = 144 per frame)
- Pose joints: six upper body joints (11, 12, 13, 14, 15, 16)
- Left hand: 21 landmarks (0–20)
- Right hand: 21 landmarks (0–20)
- Ordering per frame: pose → left hand → right hand. Each landmark: x, y, z → 48 triplets × 3 = 144 values
Normalization & serialization
- Center: mid-shoulders if pose present, else mid-wrists.
- Scale: shoulder distance if pose present, else wrist distance.
- Sanitize: fill short gaps by linear then forward/backward fill. Replace NaN and Inf with zeros. Clamp to safe numeric range. Cast to float32.
- Storage: fast binary arrays with sequence masks so attention ignores padding.
Model architecture
Encoder & decoder
The encoder ingests the landmark sequence and builds contextual representations with multi-head self-attention and position-wise feed-forward layers. The decoder generates characters step-by-step using masked self-attention and cross-attention over encoder outputs.
Practical size
~4–6 layers on both encoder and decoder, ~8 heads, model width ≈ 256–512. Dropout after attention and feed-forward; residual connections and layer normalization follow standard Transformer design.
| Component | Choice | Notes |
|---|---|---|
| Input | F = 144 per frame | Pose 11–16, both hands 0–20 |
| Encoder | Self-attention | Sinusoidal positional signals |
| Decoder | Masked self-attn + cross-attn | Greedy decode |
| Width | ≈ 256–512 | Dropout, residual, layer norm |
| Output | Letters A–Z + EOS | Softmax over characters |
Training setup
- Loss: sequence cross-entropy with teacher forcing.
- Optimization: Adam with warmup then decay.
- Batching: bucket by length, pad, and mask attention on padded positions.
- Validation: track CER and select checkpoint with lowest validation CER.
Metric
Character Error Rate (CER) measures character-level edits relative to reference length. Primary metric and basis for model selection.
Results
| Metric | Value | Notes |
|---|---|---|
| CER on validation | ≈ 0.36 | From the project repository |
Limits & next steps
- Scope is fingerspelling letters only, not the full sign lexicon.
- Very fast signing can cause dropped or merged letters.
- Lighting, occlusion, or partial hands can degrade landmarks.
Next steps: light personalization with a few user examples, optional beam search for long words, mobile deployment with quantization.
FAQ
Why landmarks instead of pixels?
Landmarks expose the essential signal — hand motion. They reduce input size and speed up training while improving robustness across camera setups.
Why a Transformer for letters?
It models temporal context so the system can disambiguate similar handshapes by using motion before and after each frame.
How is the final checkpoint chosen?
By the lowest validation character error rate on a held-out split.