Synopsis

Modern deepfake detection systems have a serious weakness: they can be fooled by tiny, invisible changes made to videos. This project tackles the challenge of building detectors that remain reliable even when attackers deliberately try to trick them. Our approach involves training a deepfake detector by constantly attacking it during the learning phase. This is similar to how vaccines work, by exposing the immune system to controlled threats, we build resistance. We use two different types of attacks during training: one that rapidly finds the most damaging modifications possible, and another that learns to create subtle, realistic-looking changes. Our results show that the detector performs just as well as traditional systems on normal videos, but it maintains much better accuracy when tested on attacked and compressed videos. Additionally, when we analyze what the detector pays attention to, we find that it focuses on natural facial features across the entire face, rather than zeroing in on technical glitches that attackers can easily hide. This explains why the system is more robust.

EfficientNet + ViT PGD + Learned U Net attacker TV and frequency constraints FaceForensics++ train Celeb DF v2 test Grad CAM explainability

1. Problem the product solves

Deepfake media is now highly realistic. Detectors trained only for accuracy on clean benchmarks can fail when an attacker adds tiny, crafted perturbations or when common compressions wash away local artifacts. This product focuses on resilience. The aim is to keep detection performance usable even when inputs are manipulated to evade the model or passed through lossy compression.

The north star metric is maintaining ROC AUC and accuracy when attacks meet compression so that screening stays dependable.

2. Solution overview

We harden a compact hybrid detector through a game between the model and two complementary attackers. The network faces PGD noise and a learned U Net attacker that produces realistic, spatially smooth, compression resilient patterns. The U Net is leashed by total variation loss to reduce high frequency speckle and by a frequency domain loss that discourages very low frequency energy. The result is an attacker that searches mid to high frequency space and forces the detector to learn more stable cues.

Compact model Dual attacker training Compression aware robustness

3. Architecture

3.1 EfficientViT detector

The backbone is EfficientNet B0 used as a feature extractor that outputs a 7 by 7 by 1280 map for a 224 by 224 face crop. We reshape this grid into 49 tokens, prepend a CLS token, add learnable positions, and pass the sequence through a small Transformer encoder with 4 layers and 4 heads. The CLS output goes into a light MLP head for binary classification.

EfficientNet B0 features ViT encoder, 4 layers MLP head

3.2 Why hybrid

CNNs capture local texture tells like edge inconsistencies. Transformers capture long range relations like lighting agreement between forehead and jaw. Together they cover both local and global cues with a small parameter budget.

4. Adversarial training

4.1 PGD attacker

We use L infinity PGD with epsilon 8 over 255, step 2 over 255, and 10 iterations. Random start is included. This gives a strong baseline adversary that finds harmful pixel level changes.

4.2 Learned U Net attacker

A tiny U Net takes the clean face crop and outputs a perturbation delta bounded in L infinity by epsilon through a tanh gate and scaling. The U Net tries to maximize detector loss while obeying realism constraints.

4.3 Realism constraints

  • Total variation loss promotes spatial smoothness, which avoids speckle like artifacts.
  • Frequency loss penalizes very low frequency energy in the perturbation spectrum, which nudges the attacker toward mid to high frequency patterns that often survive compression.

4.4 Training loop

For each batch we compute clean loss, PGD loss, update the U Net with its composite attacker loss, then update the detector on a fresh U Net perturbation. The detector minimizes the sum of clean, PGD, and U Net losses. Gradient clipping and mixed precision keep the dynamics stable.

The arms race pushes the detector to stop over relying on a single brittle cue and to combine broader evidence across the face.

5. Data pipeline and setup

We extract frames from FaceForensics plus plus c23 and Celeb DF v2, detect faces with MTCNN, then align and crop to 224 by 224. Training uses a subset of FF plus plus frames. Evaluation uses clean and attacked crops from Celeb DF v2 for cross dataset testing. Preprocessing uses resizing and ImageNet normalization. Training runs on a T4 GPU with AMP.

FF++ c23 train Celeb DF v2 test MTCNN face crops AMP enabled

6. Evaluation protocol

We compare a baseline model trained on clean data and a robust model trained with the dual attacker loop. We test clean, JPEG quality 50, an H.264 like simulation, PGD white box, learned U Net, and the combined U Net plus compression settings. We report accuracy and ROC AUC as the primary decision metrics.

Clean JPEG 50 H.264 like PGD Learned U Net U Net + compression

7. Results at a glance

7.1 Cross domain performance on Celeb DF v2

Scenario Accuracy (Base) Accuracy (Robust) AUC (Base) AUC (Robust)
Clean 0.624 0.617 0.678 0.676
JPEG 50 0.625 0.641 0.672 0.681
H.264 like 0.636 0.661 0.693 0.703
PGD white box 0.467 0.476 0.460 0.483
Learned U Net 0.619 0.607 0.675 0.674
U Net + JPEG 50 0.625 0.647 0.675 0.685
U Net + H.264 like 0.644 0.654 0.690 0.699

7.2 In domain performance on FF plus plus c23

Both models are near perfect on clean FF plus plus frames. Under compression the robust model keeps accuracy competitive while sustaining ROC AUC.

Scenario Accuracy (Base) Accuracy (Robust) AUC (Base) AUC (Robust)
Clean 0.999 0.998 ~1.000 ~1.000
JPEG 50 0.851 0.836 0.922 0.916
H.264 like 0.912 0.912 0.965 0.962

8. Explainability insights

Grad CAM comparisons show that the baseline often fires on sharp borders like jawlines or the face boundary. The robust model spreads attention over cheeks and forehead, with less spill to background. This matches the idea that adversarial training discourages single cue dependence and pushes attention toward cues that are harder to scrub out with simple edits or compression.

Thinking in frequency space helps. Baseline attention aligns with very high frequency details. The robust model raises sensitivity to mid frequency patterns and smooth shading consistency, which are more durable under compression and against small edits.

9. Limits and future work

  • Strong PGD still hurts both models on faces at this resolution. Larger models or stronger training schedules could help.
  • Cross dataset AUC around the high 0.6s shows that generalization to harder fakes remains challenging.
  • Temporal attacks are not modeled yet. Extending realism constraints to video time could improve resilience.

FAQ

Why a tiny U Net attacker rather than only PGD

PGD is strong but often looks like fine noise. A learned U Net generates patterns that are spatially coherent and better mimic small cosmetic tweaks attackers might apply. This combination reduces overfitting to one attack style.

Why total variation and frequency constraints

TV lowers speckle, which makes perturbations more plausible. Penalizing very low frequency energy avoids broad washes that compression would flatten anyway. Together they push the attacker into the band where compression is less destructive and the detector must learn more durable cues.

Does the robust model hurt clean accuracy

On clean FF plus plus and clean Celeb DF v2, robust accuracy and AUC match the baseline within noise. The advantage shows up under attack and compression, which is where it matters operationally.