1. Problem and users
Fingerspelling conveys names and terms without dedicated signs. Recognition is difficult due to fast transitions and co articulation between letters. The goal is letter accurate transcription with clear handling of similar handshapes and consistent preprocessing across signers and cameras.
2. Solution overview
The system replaces raw pixels with landmarks and frames the task as sequence to sequence prediction. MediaPipe provides 3D keypoints per frame. A Transformer captures temporal context and outputs letters A to Z plus an end token. The pipeline includes careful feature selection and normalization for stable learning.
3. Data and feature spec
3.1 Dataset
The build uses the Google American Sign Language Fingerspelling dataset with precomputed MediaPipe landmarks. Examples are landmark sequences paired with target words.
3.2 Feature selection (F = 144 per frame)
- Pose joints: six upper body joints 11, 12, 13, 14, 15, 16.
- Left hand: 21 landmarks 0 to 20.
- Right hand: 21 landmarks 0 to 20.
- Ordering per frame: pose, then left hand, then right hand. Each landmark contributes x, y, z so 48 triplets times 3 equals 144 numbers.
4. Normalization and serialization
- Center: mid shoulders if pose present, else mid wrists.
- Scale: shoulder distance if pose present, else wrist distance.
- Sanitize: fill short gaps by linear then forward and backward fill. Replace NaN and Inf with zeros. Clamp to a safe numeric range. Cast to float32.
- Axes: follow MediaPipe coordinate conventions.
- Storage: fast binary arrays with sequence masks so attention ignores padding.
These steps are mirrored wherever the model is evaluated so distribution shift is minimized.
5. Model architecture
5.1 Encoder and decoder
The encoder ingests the landmark sequence and builds contextual representations with multi head self attention and position wise feed forward layers. The decoder generates characters step by step using masked self attention and cross attention over encoder outputs. Positional signals are added on both sides so order is represented.
5.2 Practical size
A moderate footprint balances latency and accuracy. Typical settings are around four to six layers on both encoder and decoder, about eight heads, and model width near 256 to 512. Dropout after attention and after feed forward helps regularize. Residual connections and layer normalization follow the standard Transformer design.
5.3 Output vocabulary
Characters A to Z plus an end token. A linear layer maps to the character set followed by softmax. Greedy decoding is sufficient for spelling.
| Component | Choice | Notes |
|---|---|---|
| Input | F = 144 features per frame | Pose 11 to 16, both hands 0 to 20 |
| Encoder | Self attention | Sinusoidal positional signals |
| Decoder | Masked self attention plus cross attention | Greedy decode |
| Width | ≈ 256 to 512 | With dropout, residual, layer norm |
| Output | Letters A to Z plus EOS | Softmax over characters |
6. Training setup
- Loss: sequence cross entropy with teacher forcing.
- Optimization: Adam with warmup then decay.
- Batching: bucket by length, pad, and mask attention on padded positions.
- Validation: track character error rate and select the checkpoint with lowest validation CER.
7. Metric
Character Error Rate measures character level edits relative to reference length. This is the primary metric and the basis for model selection.
8. Results
| Metric | Value | Notes |
|---|---|---|
| CER on validation | ≈ 0.36 | From the project repository |
9. Limits and next steps
- Scope is fingerspelling letters, not the full sign lexicon.
- Very fast signing can cause dropped or merged letters.
- Lighting, occlusion, or partial hands can degrade landmarks.
Next steps include light personalization with a few user examples, optional beam search for long words, and mobile deployment with quantization.
FAQ
Why landmarks instead of pixels
Landmarks expose the essential signal which is hand motion. They reduce input size and speed up training while improving robustness across camera setups.
Why a Transformer for letters
It models temporal context so the system can disambiguate similar handshapes by using motion before and after each frame.
How is the final checkpoint chosen
By the lowest validation character error rate on a held out split.