1. Problem the product solves
Deepfake media is now highly realistic. Detectors trained only for accuracy on clean benchmarks can fail when an attacker adds tiny, crafted perturbations or when common compressions wash away local artifacts. This product focuses on resilience. The aim is to keep detection performance usable even when inputs are manipulated to evade the model or passed through lossy compression.
2. Solution overview
We harden a compact hybrid detector through a game between the model and two complementary attackers. The network faces PGD noise and a learned U Net attacker that produces realistic, spatially smooth, compression resilient patterns. The U Net is leashed by total variation loss to reduce high frequency speckle and by a frequency domain loss that discourages very low frequency energy. The result is an attacker that searches mid to high frequency space and forces the detector to learn more stable cues.
3. Architecture
3.1 EfficientViT detector
The backbone is EfficientNet B0 used as a feature extractor that outputs a 7 by 7 by 1280 map for a 224 by 224 face crop. We reshape this grid into 49 tokens, prepend a CLS token, add learnable positions, and pass the sequence through a small Transformer encoder with 4 layers and 4 heads. The CLS output goes into a light MLP head for binary classification.
3.2 Why hybrid
CNNs capture local texture tells like edge inconsistencies. Transformers capture long range relations like lighting agreement between forehead and jaw. Together they cover both local and global cues with a small parameter budget.
4. Adversarial training
4.1 PGD attacker
We use L infinity PGD with epsilon 8 over 255, step 2 over 255, and 10 iterations. Random start is included. This gives a strong baseline adversary that finds harmful pixel level changes.
4.2 Learned U Net attacker
A tiny U Net takes the clean face crop and outputs a perturbation delta bounded in L infinity by epsilon through a tanh gate and scaling. The U Net tries to maximize detector loss while obeying realism constraints.
4.3 Realism constraints
- Total variation loss promotes spatial smoothness, which avoids speckle like artifacts.
- Frequency loss penalizes very low frequency energy in the perturbation spectrum, which nudges the attacker toward mid to high frequency patterns that often survive compression.
4.4 Training loop
For each batch we compute clean loss, PGD loss, update the U Net with its composite attacker loss, then update the detector on a fresh U Net perturbation. The detector minimizes the sum of clean, PGD, and U Net losses. Gradient clipping and mixed precision keep the dynamics stable.
5. Data pipeline and setup
We extract frames from FaceForensics plus plus c23 and Celeb DF v2, detect faces with MTCNN, then align and crop to 224 by 224. Training uses a subset of FF plus plus frames. Evaluation uses clean and attacked crops from Celeb DF v2 for cross dataset testing. Preprocessing uses resizing and ImageNet normalization. Training runs on a T4 GPU with AMP.
6. Evaluation protocol
We compare a baseline model trained on clean data and a robust model trained with the dual attacker loop. We test clean, JPEG quality 50, an H.264 like simulation, PGD white box, learned U Net, and the combined U Net plus compression settings. We report accuracy and ROC AUC as the primary decision metrics.
7. Results at a glance
7.1 Cross domain performance on Celeb DF v2
| Scenario | Accuracy (Base) | Accuracy (Robust) | AUC (Base) | AUC (Robust) |
|---|---|---|---|---|
| Clean | 0.624 | 0.617 | 0.678 | 0.676 |
| JPEG 50 | 0.625 | 0.641 | 0.672 | 0.681 |
| H.264 like | 0.636 | 0.661 | 0.693 | 0.703 |
| PGD white box | 0.467 | 0.476 | 0.460 | 0.483 |
| Learned U Net | 0.619 | 0.607 | 0.675 | 0.674 |
| U Net + JPEG 50 | 0.625 | 0.647 | 0.675 | 0.685 |
| U Net + H.264 like | 0.644 | 0.654 | 0.690 | 0.699 |
7.2 In domain performance on FF plus plus c23
Both models are near perfect on clean FF plus plus frames. Under compression the robust model keeps accuracy competitive while sustaining ROC AUC.
| Scenario | Accuracy (Base) | Accuracy (Robust) | AUC (Base) | AUC (Robust) |
|---|---|---|---|---|
| Clean | 0.999 | 0.998 | ~1.000 | ~1.000 |
| JPEG 50 | 0.851 | 0.836 | 0.922 | 0.916 |
| H.264 like | 0.912 | 0.912 | 0.965 | 0.962 |
8. Explainability insights
Grad CAM comparisons show that the baseline often fires on sharp borders like jawlines or the face boundary. The robust model spreads attention over cheeks and forehead, with less spill to background. This matches the idea that adversarial training discourages single cue dependence and pushes attention toward cues that are harder to scrub out with simple edits or compression.
Thinking in frequency space helps. Baseline attention aligns with very high frequency details. The robust model raises sensitivity to mid frequency patterns and smooth shading consistency, which are more durable under compression and against small edits.
9. Limits and future work
- Strong PGD still hurts both models on faces at this resolution. Larger models or stronger training schedules could help.
- Cross dataset AUC around the high 0.6s shows that generalization to harder fakes remains challenging.
- Temporal attacks are not modeled yet. Extending realism constraints to video time could improve resilience.
FAQ
Why a tiny U Net attacker rather than only PGD
PGD is strong but often looks like fine noise. A learned U Net generates patterns that are spatially coherent and better mimic small cosmetic tweaks attackers might apply. This combination reduces overfitting to one attack style.
Why total variation and frequency constraints
TV lowers speckle, which makes perturbations more plausible. Penalizing very low frequency energy avoids broad washes that compression would flatten anyway. Together they push the attacker into the band where compression is less destructive and the detector must learn more durable cues.
Does the robust model hurt clean accuracy
On clean FF plus plus and clean Celeb DF v2, robust accuracy and AUC match the baseline within noise. The advantage shows up under attack and compression, which is where it matters operationally.