Transformer-Based ASL Fingerspelling to Text

Synopsis

Sign language users communicate through hand shapes and movements, but most digital systems cannot understand these gestures. This research builds a system that automatically translates American Sign Language fingerspelling, where each hand shape represents a letter, into written English text. The system first tracks the 3D positions of hands and upper body joints in video using existing computer vision software. These position measurements are then fed into a neural network that understands sequences and context, learning to recognize individual letters and detect when words end. When tested on new videos, the system achieved a character error rate of approximately 0.36, meaning it correctly identified about 64% of characters. This represents a meaningful step toward making sign language more accessible through automated translation.

MediaPipe landmarks Transformer encoder and decoder CER ≈ 0.36 (validation) Sequence normalization Greedy decoding

1. Problem and users

Fingerspelling conveys names and terms without dedicated signs. Recognition is difficult due to fast transitions and co articulation between letters. The goal is letter accurate transcription with clear handling of similar handshapes and consistent preprocessing across signers and cameras.

2. Solution overview

The system replaces raw pixels with landmarks and frames the task as sequence to sequence prediction. MediaPipe provides 3D keypoints per frame. A Transformer captures temporal context and outputs letters A to Z plus an end token. The pipeline includes careful feature selection and normalization for stable learning.

Sequence to sequence On device landmarks Context aware decoding

3. Data and feature spec

3.1 Dataset

The build uses the Google American Sign Language Fingerspelling dataset with precomputed MediaPipe landmarks. Examples are landmark sequences paired with target words.

3.2 Feature selection (F = 144 per frame)

Pose joints: six upper body joints 11, 12, 13, 14, 15, 16.
Left hand: 21 landmarks 0 to 20.
Right hand: 21 landmarks 0 to 20.
Ordering per frame: pose, then left hand, then right hand. Each landmark contributes x, y, z so 48 triplets times 3 equals 144 numbers.

4. Normalization and serialization

Center: mid shoulders if pose present, else mid wrists.
Scale: shoulder distance if pose present, else wrist distance.
Sanitize: fill short gaps by linear then forward and backward fill. Replace NaN and Inf with zeros. Clamp to a safe numeric range. Cast to float32.
Axes: follow MediaPipe coordinate conventions.
Storage: fast binary arrays with sequence masks so attention ignores padding.

These steps are mirrored wherever the model is evaluated so distribution shift is minimized.

5. Model architecture

5.1 Encoder and decoder

The encoder ingests the landmark sequence and builds contextual representations with multi head self attention and position wise feed forward layers. The decoder generates characters step by step using masked self attention and cross attention over encoder outputs. Positional signals are added on both sides so order is represented.

5.2 Practical size

A moderate footprint balances latency and accuracy. Typical settings are around four to six layers on both encoder and decoder, about eight heads, and model width near 256 to 512. Dropout after attention and after feed forward helps regularize. Residual connections and layer normalization follow the standard Transformer design.

5.3 Output vocabulary

Characters A to Z plus an end token. A linear layer maps to the character set followed by softmax. Greedy decoding is sufficient for spelling.

Component	Choice	Notes
Input	F = 144 features per frame	Pose 11 to 16, both hands 0 to 20
Encoder	Self attention	Sinusoidal positional signals
Decoder	Masked self attention plus cross attention	Greedy decode
Width	≈ 256 to 512	With dropout, residual, layer norm
Output	Letters A to Z plus EOS	Softmax over characters

6. Training setup

Loss: sequence cross entropy with teacher forcing.
Optimization: Adam with warmup then decay.
Batching: bucket by length, pad, and mask attention on padded positions.
Validation: track character error rate and select the checkpoint with lowest validation CER.

7. Metric

Character Error Rate measures character level edits relative to reference length. This is the primary metric and the basis for model selection.

8. Results

Metric	Value	Notes
CER on validation	≈ 0.36	From the project repository

Common confusions include similar shapes like I vs J, M vs N, and U vs V. Double letters may need a brief pause for clear separation.

9. Limits and next steps

Scope is fingerspelling letters, not the full sign lexicon.
Very fast signing can cause dropped or merged letters.
Lighting, occlusion, or partial hands can degrade landmarks.

Next steps include light personalization with a few user examples, optional beam search for long words, and mobile deployment with quantization.

FAQ

Why landmarks instead of pixels

Landmarks expose the essential signal which is hand motion. They reduce input size and speed up training while improving robustness across camera setups.

Why a Transformer for letters

It models temporal context so the system can disambiguate similar handshapes by using motion before and after each frame.

How is the final checkpoint chosen

By the lowest validation character error rate on a held out split.