Figure 1. Illustration of cross-lingual generalization challenges in current deepfake detection systems. While the model successfully processes familiar scripts (top path, "Great Performance"), it fails to maintain performance when encountering different scripts and language families (bottom path, "Poor Performance"), demonstrating the critical need for improved cross-lingual generalization in audio deepfake detection.
Current audio deepfake detection systems, while effective for widely-studied languages like English, demonstrate significant performance degradation when encountering unfamiliar languages and scripts. This limitation is particularly concerning in linguistically diverse regions like India, where recent incidents of 50 million AI-generated voice clone calls span across 22 official languages and thousands of dialects. The inability of existing models to generalize across different scripts and language families creates critical security vulnerabilities, highlighting the urgent need for robust multilingual detection systems that can maintain consistent performance across seen and unseen languages These shortcomings indicate a significant scope for improvement.
The key contributions of our work are as follows:
Figure 2. Overview of the IndicFake benchmark and SAFARI-LLM. The top panel shows the IndicFake dataset spanning 18 Indian languages across three language families, while the bottom panel illustrates our proposed SAFARI-LLM architecture leveraging dual speech encoders and feature fusion. Three key research questions (RQ1-3) guide our investigation of cross-lingual generalization, cross-language family transfer, and architectural impact on deepfake detection performance
Distribution of real and synthetic speech samples across 17 Indian languages and English, showing sample counts for different TTS systems with gender-wise splits and total dataset composition.
Distribution of different languages in IndicFake dataset.
IndicFake key statistics.
Distribution of gender in IndicFake dataset.
Distribution for duration of audio in IndicFake dataset.