This study proposes a deepfake video detection framework leveraging multimodal feature fusion and adversarial enhancement to address limitations in single-modality detectors for high-quality forgeries and noise interference, systematically integrating cross-modal consistency analysis and robustness training through a tri-modal architecture extracting spatio-temporal visual features via SlowFast-R50, audio context embeddings using VGGish-BiLSTM, and text semantics through Whisper-Transformer, dynamically fused via cross-modal self-attention with adaptive weight allocation, while a dual-branch discriminator jointly optimizes classification accuracy and cross-modal consistency losses; FGSM-based adversarial training injects perturbations in both RGB frame and audio spectrogram domains to enhance robustness against Gaussian/salt-and-pepper noise (σ=0.05/0.02), achieving state-of-the-art performance on FaceForensics++ with video-level accuracies of 98.9% (DeepFake), 98.8% (FaceSwap), 97.6% (Face2Face), and 92.8% (NeuralTextures), exceeding benchmarks like ResNet18 by 1.1–5.1%, maintaining ≥88.5% accuracy under noise and 0.893 ROC-AUC, where multimodal fusion captures subtle cross-modal contradictions while adversarial training ensures stable decision boundaries near perturbation thresholds.