Transformer-Based ASL Fingerspelling to Text

Transformer-Based ASL Fingerspelling to Text

Sign language users communicate through hand shapes and movements, but most digital systems cannot understand these gestures. This system automatically translates American Sign Language fingerspelling — where each hand shape represents a letter — into written English text. It first tracks the 3D positions of hands and upper body joints in video using MediaPipe, then feeds these landmark sequences into a Transformer encoder-decoder that recognizes individual letters and detects word boundaries. Validated at a character error rate of ≈ 0.36 (64% character accuracy), representing a meaningful step toward automated sign language accessibility.

MediaPipe Landmarks Transformer Encoder-Decoder CER ≈ 0.36 (validation) Sequence Normalization Greedy Decoding

Problem & users

Fingerspelling conveys names and terms without dedicated signs. Recognition is difficult due to fast transitions and co-articulation between letters. The goal is letter-accurate transcription with clear handling of similar handshapes and consistent preprocessing across signers and cameras.

Solution overview

The system replaces raw pixels with landmarks and frames the task as sequence-to-sequence prediction. MediaPipe provides 3D keypoints per frame. A Transformer captures temporal context and outputs letters A–Z plus an end token. The pipeline includes careful feature selection and normalization for stable learning.

Sequence-to-sequence On-device landmarks Context-aware decoding

Data & feature spec

Dataset

Google American Sign Language Fingerspelling dataset with precomputed MediaPipe landmarks. Examples are landmark sequences paired with target words.

Feature selection (F = 144 per frame)

  • Pose joints: six upper body joints (11, 12, 13, 14, 15, 16)
  • Left hand: 21 landmarks (0–20)
  • Right hand: 21 landmarks (0–20)
  • Ordering per frame: pose → left hand → right hand. Each landmark: x, y, z → 48 triplets × 3 = 144 values

Normalization & serialization

  • Center: mid-shoulders if pose present, else mid-wrists.
  • Scale: shoulder distance if pose present, else wrist distance.
  • Sanitize: fill short gaps by linear then forward/backward fill. Replace NaN and Inf with zeros. Clamp to safe numeric range. Cast to float32.
  • Storage: fast binary arrays with sequence masks so attention ignores padding.

Model architecture

Encoder & decoder

The encoder ingests the landmark sequence and builds contextual representations with multi-head self-attention and position-wise feed-forward layers. The decoder generates characters step-by-step using masked self-attention and cross-attention over encoder outputs.

Practical size

~4–6 layers on both encoder and decoder, ~8 heads, model width ≈ 256–512. Dropout after attention and feed-forward; residual connections and layer normalization follow standard Transformer design.

ComponentChoiceNotes
InputF = 144 per framePose 11–16, both hands 0–20
EncoderSelf-attentionSinusoidal positional signals
DecoderMasked self-attn + cross-attnGreedy decode
Width≈ 256–512Dropout, residual, layer norm
OutputLetters A–Z + EOSSoftmax over characters

Training setup

  • Loss: sequence cross-entropy with teacher forcing.
  • Optimization: Adam with warmup then decay.
  • Batching: bucket by length, pad, and mask attention on padded positions.
  • Validation: track CER and select checkpoint with lowest validation CER.

Metric

Character Error Rate (CER) measures character-level edits relative to reference length. Primary metric and basis for model selection.

Results

MetricValueNotes
CER on validation≈ 0.36From the project repository
Common confusions: I vs J, M vs N, U vs V. Double letters may need a brief pause for clear separation.

Limits & next steps

  • Scope is fingerspelling letters only, not the full sign lexicon.
  • Very fast signing can cause dropped or merged letters.
  • Lighting, occlusion, or partial hands can degrade landmarks.

Next steps: light personalization with a few user examples, optional beam search for long words, mobile deployment with quantization.

FAQ

Why landmarks instead of pixels?

Landmarks expose the essential signal — hand motion. They reduce input size and speed up training while improving robustness across camera setups.

Why a Transformer for letters?

It models temporal context so the system can disambiguate similar handshapes by using motion before and after each frame.

How is the final checkpoint chosen?

By the lowest validation character error rate on a held-out split.