Adversarially Robust Deepfake Detection

Adversarially Robust Deepfake Detection

Modern deepfake detection systems have a serious weakness: they can be fooled by tiny, invisible changes made to videos. This project tackles the challenge of building detectors that remain reliable even when attackers deliberately try to trick them. Our approach involves training a deepfake detector by constantly attacking it during the learning phase — similar to how vaccines work. We use two complementary attacks during training: PGD (which finds maximally damaging pixel-level modifications) and a learned U-Net attacker that generates subtle, spatially-coherent perturbations. Our results show the detector maintains much better accuracy on attacked and compressed videos, while focusing on natural facial features across the full face rather than brittle local artifacts.

EfficientNet + ViT PGD + Learned U-Net Attacker TV & Frequency Constraints FaceForensics++ Train Celeb-DF v2 Test Grad-CAM Explainability

Problem statement

Deepfake media is now highly realistic. Detectors trained only for accuracy on clean benchmarks fail when an attacker adds tiny, crafted perturbations or when common compressions wash away local artifacts. This project focuses on resilience — keeping detection performance usable even when inputs are manipulated or passed through lossy compression.

The north-star metric is maintaining ROC AUC and accuracy when attacks meet compression, so that screening stays dependable.

Solution overview

We harden a compact hybrid detector through a game between the model and two complementary attackers. The network faces PGD noise and a learned U-Net attacker that produces realistic, spatially smooth, compression-resilient patterns. The U-Net is leashed by total variation loss to reduce high-frequency speckle and by a frequency-domain loss that discourages very low-frequency energy, forcing the detector to learn more stable cues.

Compact model Dual-attacker training Compression-aware robustness

Architecture

EfficientViT detector

The backbone is EfficientNet-B0 used as a feature extractor outputting a 7×7×1280 map for a 224×224 face crop. We reshape this grid into 49 tokens, prepend a CLS token, add learnable positions, and pass the sequence through a small Transformer encoder with 4 layers and 4 heads. The CLS output goes into a light MLP head for binary classification.

EfficientNet-B0 features ViT encoder, 4 layers MLP head

Why hybrid?

CNNs capture local texture tells like edge inconsistencies. Transformers capture long-range relations like lighting agreement between forehead and jaw. Together they cover both local and global cues with a small parameter budget.

Adversarial training

PGD attacker

L∞ PGD with ε = 8/255, step = 2/255, 10 iterations with random start — gives a strong baseline adversary.

Learned U-Net attacker

A tiny U-Net takes the clean face crop and outputs a perturbation δ bounded in L∞ by ε through a tanh gate. The U-Net tries to maximize detector loss while obeying realism constraints.

Realism constraints

  • Total variation loss promotes spatial smoothness, avoiding speckle-like artifacts.
  • Frequency loss penalizes very low-frequency energy, nudging the attacker toward mid-to-high frequency patterns that survive compression.
The arms race pushes the detector to stop over-relying on a single brittle cue and to combine broader evidence across the face.

Data pipeline

Frames extracted from FaceForensics++ c23 and Celeb-DF v2. Faces detected with MTCNN, aligned and cropped to 224×224. Training uses FF++ frames; evaluation uses Celeb-DF v2 for cross-dataset testing. Preprocessing: resize + ImageNet normalization. Training on T4 GPU with AMP.

FF++ c23 train Celeb-DF v2 test MTCNN face crops AMP enabled

Evaluation protocol

We compare a baseline (trained on clean data) vs. a robust model (dual-attacker loop). Test conditions: clean, JPEG-50, H.264-like, PGD white-box, learned U-Net, and U-Net + compression. Primary metrics: accuracy and ROC AUC.

Clean JPEG-50 H.264-like PGD Learned U-Net U-Net + compression

Results

Cross-domain — Celeb-DF v2

ScenarioAcc (Base)Acc (Robust)AUC (Base)AUC (Robust)
Clean0.6240.6170.6780.676
JPEG-500.6250.6410.6720.681
H.264-like0.6360.6610.6930.703
PGD white-box0.4670.4760.4600.483
Learned U-Net0.6190.6070.6750.674
U-Net + JPEG-500.6250.6470.6750.685
U-Net + H.264-like0.6440.6540.6900.699

In-domain — FF++ c23

ScenarioAcc (Base)Acc (Robust)AUC (Base)AUC (Robust)
Clean0.9990.998~1.000~1.000
JPEG-500.8510.8360.9220.916
H.264-like0.9120.9120.9650.962

Explainability

Grad-CAM comparisons show the baseline often fires on sharp borders like jawlines or face boundaries. The robust model spreads attention over cheeks and forehead, with less spill to background — consistent with adversarial training discouraging single-cue dependence and pushing attention toward cues harder to scrub with edits or compression.

In frequency space: baseline attention aligns with very high-frequency details. The robust model raises sensitivity to mid-frequency patterns and smooth shading consistency, which are more durable under compression and against small edits.

Limits & future work

  • Strong PGD still hurts both models at this resolution. Larger models or stronger schedules could help.
  • Cross-dataset AUC in the high 0.6s shows that generalizing to harder fakes remains challenging.
  • Temporal attacks are not yet modeled — extending realism constraints to video time could improve resilience.

FAQ

Why a U-Net attacker rather than only PGD?

PGD is strong but often looks like fine noise. A learned U-Net generates spatially coherent patterns that better mimic cosmetic tweaks, reducing overfitting to one attack style.

Why TV and frequency constraints?

TV lowers speckle for plausibility. Penalizing very low-frequency energy avoids broad washes flattened by compression. Together they push the attacker into the band where compression is less destructive.

Does the robust model hurt clean accuracy?

On clean FF++ and clean Celeb-DF v2, robust accuracy and AUC match the baseline within noise. The advantage shows under attack and compression — where it matters operationally.