Adversarially Robust Deepfake Detection
Modern deepfake detection systems have a serious weakness: they can be fooled by tiny, invisible changes made to videos. This project tackles the challenge of building detectors that remain reliable even when attackers deliberately try to trick them. Our approach involves training a deepfake detector by constantly attacking it during the learning phase — similar to how vaccines work. We use two complementary attacks during training: PGD (which finds maximally damaging pixel-level modifications) and a learned U-Net attacker that generates subtle, spatially-coherent perturbations. Our results show the detector maintains much better accuracy on attacked and compressed videos, while focusing on natural facial features across the full face rather than brittle local artifacts.
Problem statement
Deepfake media is now highly realistic. Detectors trained only for accuracy on clean benchmarks fail when an attacker adds tiny, crafted perturbations or when common compressions wash away local artifacts. This project focuses on resilience — keeping detection performance usable even when inputs are manipulated or passed through lossy compression.
Solution overview
We harden a compact hybrid detector through a game between the model and two complementary attackers. The network faces PGD noise and a learned U-Net attacker that produces realistic, spatially smooth, compression-resilient patterns. The U-Net is leashed by total variation loss to reduce high-frequency speckle and by a frequency-domain loss that discourages very low-frequency energy, forcing the detector to learn more stable cues.
Architecture
EfficientViT detector
The backbone is EfficientNet-B0 used as a feature extractor outputting a 7×7×1280 map for a 224×224 face crop. We reshape this grid into 49 tokens, prepend a CLS token, add learnable positions, and pass the sequence through a small Transformer encoder with 4 layers and 4 heads. The CLS output goes into a light MLP head for binary classification.
Why hybrid?
CNNs capture local texture tells like edge inconsistencies. Transformers capture long-range relations like lighting agreement between forehead and jaw. Together they cover both local and global cues with a small parameter budget.
Adversarial training
PGD attacker
L∞ PGD with ε = 8/255, step = 2/255, 10 iterations with random start — gives a strong baseline adversary.
Learned U-Net attacker
A tiny U-Net takes the clean face crop and outputs a perturbation δ bounded in L∞ by ε through a tanh gate. The U-Net tries to maximize detector loss while obeying realism constraints.
Realism constraints
- Total variation loss promotes spatial smoothness, avoiding speckle-like artifacts.
- Frequency loss penalizes very low-frequency energy, nudging the attacker toward mid-to-high frequency patterns that survive compression.
Data pipeline
Frames extracted from FaceForensics++ c23 and Celeb-DF v2. Faces detected with MTCNN, aligned and cropped to 224×224. Training uses FF++ frames; evaluation uses Celeb-DF v2 for cross-dataset testing. Preprocessing: resize + ImageNet normalization. Training on T4 GPU with AMP.
Evaluation protocol
We compare a baseline (trained on clean data) vs. a robust model (dual-attacker loop). Test conditions: clean, JPEG-50, H.264-like, PGD white-box, learned U-Net, and U-Net + compression. Primary metrics: accuracy and ROC AUC.
Results
Cross-domain — Celeb-DF v2
| Scenario | Acc (Base) | Acc (Robust) | AUC (Base) | AUC (Robust) |
|---|---|---|---|---|
| Clean | 0.624 | 0.617 | 0.678 | 0.676 |
| JPEG-50 | 0.625 | 0.641 | 0.672 | 0.681 |
| H.264-like | 0.636 | 0.661 | 0.693 | 0.703 |
| PGD white-box | 0.467 | 0.476 | 0.460 | 0.483 |
| Learned U-Net | 0.619 | 0.607 | 0.675 | 0.674 |
| U-Net + JPEG-50 | 0.625 | 0.647 | 0.675 | 0.685 |
| U-Net + H.264-like | 0.644 | 0.654 | 0.690 | 0.699 |
In-domain — FF++ c23
| Scenario | Acc (Base) | Acc (Robust) | AUC (Base) | AUC (Robust) |
|---|---|---|---|---|
| Clean | 0.999 | 0.998 | ~1.000 | ~1.000 |
| JPEG-50 | 0.851 | 0.836 | 0.922 | 0.916 |
| H.264-like | 0.912 | 0.912 | 0.965 | 0.962 |
Explainability
Grad-CAM comparisons show the baseline often fires on sharp borders like jawlines or face boundaries. The robust model spreads attention over cheeks and forehead, with less spill to background — consistent with adversarial training discouraging single-cue dependence and pushing attention toward cues harder to scrub with edits or compression.
In frequency space: baseline attention aligns with very high-frequency details. The robust model raises sensitivity to mid-frequency patterns and smooth shading consistency, which are more durable under compression and against small edits.
Limits & future work
- Strong PGD still hurts both models at this resolution. Larger models or stronger schedules could help.
- Cross-dataset AUC in the high 0.6s shows that generalizing to harder fakes remains challenging.
- Temporal attacks are not yet modeled — extending realism constraints to video time could improve resilience.
FAQ
Why a U-Net attacker rather than only PGD?
PGD is strong but often looks like fine noise. A learned U-Net generates spatially coherent patterns that better mimic cosmetic tweaks, reducing overfitting to one attack style.
Why TV and frequency constraints?
TV lowers speckle for plausibility. Penalizing very low-frequency energy avoids broad washes flattened by compression. Together they push the attacker into the band where compression is less destructive.
Does the robust model hurt clean accuracy?
On clean FF++ and clean Celeb-DF v2, robust accuracy and AUC match the baseline within noise. The advantage shows under attack and compression — where it matters operationally.