IndicFake Meets SAFARI-LLM:

Unifying Semantic and Acoustic Intelligence for Multilingual Deepfake Detection

Anonymous University

Motivation

Figure 1. Illustration of cross-lingual generalization challenges in current deepfake detection systems. While the model successfully processes familiar scripts (top path, "Great Performance"), it fails to maintain performance when encountering different scripts and language families (bottom path, "Poor Performance"), demonstrating the critical need for improved cross-lingual generalization in audio deepfake detection.

Current audio deepfake detection systems, while effective for widely-studied languages like English, demonstrate significant performance degradation when encountering unfamiliar languages and scripts. This limitation is particularly concerning in linguistically diverse regions like India, where recent incidents of 50 million AI-generated voice clone calls span across 22 official languages and thousands of dialects. The inability of existing models to generalize across different scripts and language families creates critical security vulnerabilities, highlighting the urgent need for robust multilingual detection systems that can maintain consistent performance across seen and unseen languages These shortcomings indicate a significant scope for improvement.

Contributions

The key contributions of our work are as follows:

  • The IndicFake Dataset: A large-scale multilingual audio deepfake dataset containing 4.2 million samples across 18 Indian languages from three major language families, filling a crucial gap in existing datasets for Indian language coverage.
  • The SAFARI-LLM Architecture: A novel dual-stream approach combining Whisper and m-HuBERT encoders with an Audio Feature Unification Module and large language model integration, designed specifically for multilingual deepfake detection.
  • Comprehensive Cross-Lingual Evaluation: Comprehensive Cross-Lingual Evaluation: Extensive analysis of cross-lingual and cross-language family generalization capabilities, demonstrating superior performance compared to existing approaches and providing insights into multilingual deepfake detection challenges.

IndicFake & SAFARI-LLM

Figure 2. Overview of the IndicFake benchmark and SAFARI-LLM. The top panel shows the IndicFake dataset spanning 18 Indian languages across three language families, while the bottom panel illustrates our proposed SAFARI-LLM architecture leveraging dual speech encoders and feature fusion. Three key research questions (RQ1-3) guide our investigation of cross-lingual generalization, cross-language family transfer, and architectural impact on deepfake detection performance

IndicFake Dataset Statistics

Distribution of real and synthetic speech samples across 17 Indian languages and English, showing sample counts for different TTS systems with gender-wise splits and total dataset composition.

-->
Acknowledgement: The website template is taken from Nerfies