BLUFF: Benchmarking in Low-resoUrce Languages for detecting Falsehoods and Fake news
Code & Documentation contains source code, instructions, detailed descriptions of data collection, curation, and organization methods, metadata, preprocessing steps, and configuration files. 🤗 Dataset & Splits on HuggingFace hosts the actual dataset, evaluation splits, and source data for direct download and use.
Abstract
Multilingual falsehoods threaten information integrity worldwide, yet detection benchmarks remain confined to English or a few high-resource languages, leaving low-resource linguistic communities without robust defense tools. We introduce BLUFF (Benchmarking in Low-resoUrce Languages for detecting Falsehoods and Fake news), a comprehensive benchmark for detecting false and synthetic content, spanning 79 languages with over 202K samples, combining human-written fact-checked content (122K+ samples across 57 languages) and LLM-generated content (79K+ samples across 71 languages). BLUFF uniquely covers both high-resource “big-head” (20) and low-resource “long-tail” (59) languages, addressing critical gaps in multilingual research on detecting false and synthetic content. Our dataset features four content types (human-written, LLM-generated, LLM-translated, and hybrid human-LLM text), bidirectional translation (English↔X), 39 textual modification techniques (36 manipulation tactics for fake news, 3 AI-editing strategies for real news), and varying edit intensities generated using 19 diverse LLMs. We present AXL-CoI (Adversarial Cross-Lingual Agentic Chain-of-Interactions), a novel multi-agentic framework for controlled fake/real news generation, paired with mPURIFY, a quality filtering pipeline ensuring dataset integrity. Experiments reveal state-of-the-art detectors suffer up to 25.3% F1 degradation on low-resource versus high-resource languages. BLUFF provides the research community with a multilingual benchmark, extensive linguistic-oriented benchmark evaluation, comprehensive documentation, and open-source tools to advance equitable falsehood detection.
Key Highlights
79
Languages
12 language families, 10+ scripts, 4 syntactic orders
202K+
Samples
122K human-written + 79K LLM-generated
19
LLMs Used
GPT-4o, Claude, Gemini, Llama, Qwen, Aya, DeepSeek & more
4
Benchmark Tasks
Veracity + Authorship (binary & multi-class)
39
Modification Techniques
36 manipulation tactics + 3 editing strategies
297K+
Seed Articles
From Global News, CNN/DM, MassiveSumm, Visual News
Methodology
The BLUFF pipeline implements an eight-stage process for multilingual generation and detection of false and synthetic content. Beginning with benchmark news corpora (Stage 1), we filter sources by reputation using the Iffy Index (Stage 2), selecting reputable organizations for real news and flagged sources for fake news seeds. From a parametric dictionary (Stage 3), we configure generation variables: language (79), transformation technique (36 tactics or 3 AI-edits), editing degree (3 levels), and jailbreak strategy (21+). These parameters feed into differentiated AXL-CoI prompts (Stage 4) processed by 19 frontier mLLMs (Stage 5) to generate bidirectionally translated content (English↔70 languages). All outputs undergo mPURIFY quality filtering (Stage 6), removing hallucinations, mistranslations, and structural defects. We enrich the dataset with human-written, fact-checked content from IFCN-certified organizations. Our BLUFF scraper machine translates (50→79 languages) human-written data (Stage 7). Finally, we evaluate detection capabilities (Stage 8) using fine-tuned encoder-based and in-context learning decoder-based multilingual transformers.
AXL-CoI: Adversarial Cross-Lingual Agentic Chain-of-Interactions
BLUFF's fake news generation relies on AXL-CoI, an autonomous multi-agent pipeline that orchestrates LLMs through chained interactions to produce realistic multilingual disinformation. The framework operates through two complementary pipelines:
- Fake news pipeline (10 chains): Source ingestion → ADIS safety bypass → adversarial persona assignment → cross-lingual manipulation → social media adaptation → translation → style transfer → quality verification → metadata extraction → output aggregation
- Real news pipeline (8 chains): Source ingestion → content extraction → cross-lingual translation → social media adaptation → translation → style normalization → metadata extraction → output aggregation
A key innovation is ADIS (Autonomous Dynamic Impersonation Self-Attack), which achieves a 100% safety bypass rate across all 19 frontier models. Unlike static jailbreaks, ADIS dynamically generates context-appropriate impersonation strategies (21 unique strategies identified) that frame fake news generation as legitimate professional activities (e.g., journalism education, media literacy research, fact-checking training).
Generation Pipeline. Seven orthogonal dimensions (language, directionality, model, veracity, source, technique, degree) yield 30,240 unique fake news and 144 real news configurations per language.
mPURIFY: Multilingual Quality Filtering
To ensure dataset integrity, all generated samples pass through mPURIFY, a comprehensive quality filtering pipeline employing 32 features across 5 dimensions:
mPURIFY Overview. Five quality dimensions — consistency, validation, translation, hallucination, and defective content detection — with standard AEM and LLM-AEM evaluation modes.
| Dimension | Features | Purpose |
|---|---|---|
| Consistency | 8 | Cross-field semantic coherence, veracity-content alignment |
| Validation | 6 | Structural completeness, format compliance, field presence |
| Translation | 8 | Language verification, script validation, translation fidelity |
| Hallucination | 5 | Factual grounding, source faithfulness, fabrication detection |
| Defective | 5 | Encoding issues, truncation, repetition, formatting errors |
Filtering results: 181,966 initial samples → 87,211 defect-free → 79,559 retained (43.7% retention rate)
Data Collection
Human-Written Data Curation
Human-written content is sourced from 130 IFCN-certified fact-checking organizations worldwide, ensuring editorial quality and veracity labels grounded in professional journalism standards. A custom web scraping pipeline collects articles in their original languages, preserving linguistic authenticity. Source selection prioritizes organizations verified by the International Fact-Checking Network (IFCN), cross-referencing against the Iffy Index of unreliable sources to exclude any outlets flagged for low journalistic standards.
The collection spans 57 languages (19 big-head, 38 long-tail), providing authentic human-written baselines against which machine-generated text is compared. Articles are processed using Qwen3-8B for initial language identification and GPT-5 for veracity label extraction, disambiguating publisher-specific rating scales (e.g., "Pants on Fire," "Four Pinocchios") into standardized real/fake labels. Qwen3-32B handles structured information extraction. All three models are prompted with explicit instructions to preserve original-language text without translation.
Web Scraping Pipeline. Automated collection from 130 IFCN-certified fact-checkers with source reputation filtering via the Iffy Index.
Multilingual Generation Pipeline
The synthetic data generation pipeline draws from 297,000+ seed articles across four complementary source corpora. Stratified random sampling (seed 42) ensures balanced representation across languages and source types, and each seed article is used exactly once to prevent data leakage.
| Source Dataset | Articles | Type | Languages |
|---|---|---|---|
| Global News Dataset | ~82,000 | Multilingual news | Multiple |
| CNN/Daily Mail | ~82,000 | English news with summaries | English |
| MassiveSumm | ~51,000 | Multilingual summaries | Multiple |
| Visual News | ~82,000 | Multimodal news | English |
| Total | 297,000+ | Seed articles for generation | |
Generation Models
Generation employs 19 models spanning two categories: 13 instruction-tuned LLMs (GPT-4.1, GPT-4.1 mini, GPT-4.1 nano, Gemini 2.0 Flash, Gemini 1.5 Flash, Llama 3.3 70B, Llama 4 Scout, Llama 4 Maverick, Aya Expanse 32B, Mistral Large, Phi-4, Qwen3-8B, Qwen3-32B) and 6 reasoning LRMs (DeepSeek-R1, DeepSeek-R1 Distill Qwen 32B, DeepSeek-R1 Distill Llama 70B, QwQ 32B, o1, Gemini 2.0 Flash Thinking). This diverse model selection ensures the benchmark captures a wide range of generation artifacts, from fluent instruction-following outputs to chain-of-thought reasoning traces.
Bidirectional Translation
Each generation request uses one of 4 prompt variants defined by crossing two veracity orientations (Fake, Real) with two translation directions: English→X (covering 70 languages) and X→English (covering 50 languages with sufficient non-English seeds). This bidirectional design captures both translation-into and translation-from artifacts, which exhibit distinct statistical signatures in machine-translated text (MTT). The complete generation pipeline yields approximately 181,000 samples spanning 1,890 unique tactic combinations of model, prompt variant, language, and source corpus.
Dataset Statistics
After mPURIFY quality filtering, BLUFF contains 202,395 samples across 79 languages. The authorship distribution is:
| Category | Description | Samples | Share |
|---|---|---|---|
| HWT | Human-Written Text | 122,836 | 60.7% |
| HAT | Human-Adapted Text (adversarial rewrites) | 68,148 | 33.7% |
| MGT | Machine-Generated Text (direct LLM output) | 19,234 | 9.5% |
| MTT | Machine-Translated Text (bidirectional) | 156,886 | 77.5% |
Note: Categories overlap—MGT and MTT are subsets of machine-produced content; HAT combines human editing with machine translation. Total unique samples: 202,395.
The benchmark achieves 100% coverage of all 1,890 unique manipulation tactic combinations and 5 of 9 editing configurations (55.6%). Linguistic diversity spans 12 genetic families, 9 script types, and 6 syntactic typologies—providing comprehensive evaluation across the world's linguistic landscape.
Language Coverage Across BLUFF Subsets. Of 79 unique languages, 49 appear in both AI-generated and human-written subsets, 22 are exclusive to AI-generated data, and 8 are exclusive to human-written data.
HWT Language Distribution. 122,836 samples across 57 languages (19 big-head, 38 long-tail).
AI-Generated Language Distribution. 79,943 samples across 71 languages (20 big-head, 51 long-tail).
Hierarchical Language Classification. All 79 BLUFF languages organized by genetic relationship (12 families), script relationship (9 script types), and syntactic relationship (6 typologies).
Benchmark Tasks
BLUFF supports four complementary tasks spanning veracity classification and authorship detection, each evaluated using Macro-F1 to account for class imbalance across languages:
| Task | Description | Classes | Metric |
|---|---|---|---|
| Task 1 | Binary Veracity Classification | Real / Fake | Macro-F1 |
| Task 2 | Multi-class Veracity Classification | Real / Fake × Source Type (8 classes) | Macro-F1 |
| Task 3 | Binary Authorship Detection | Human / Machine | Macro-F1 |
| Task 4 | Multi-class Authorship Attribution | HWT / MGT / MTT / HAT | Macro-F1 |
Content types: HWT = Human-Written Text, MGT = Machine-Generated Text, MTT = Machine-Translated Text, HAT = Human-Adapted Text
Key Results
Cross-lingual transfer heatmaps showing performance variation across language pairs, revealing significant gaps between high-resource and low-resource languages.
Cross-lingual transfer performance grouped by language family, demonstrating the impact of linguistic relatedness on detection accuracy.
Transfer performance by script type, highlighting how shared writing systems facilitate knowledge transfer across languages.
Transfer performance across syntactic typologies (SOV, SVO, VSO, VOS), revealing structural influences on cross-lingual generalization.
Veracity classification performance across language families, showing consistent 9–25% performance gaps between big-head and long-tail languages.
Authorship detection performance across different script types, revealing varying difficulty levels for machine-generated text detection.
mPURIFY quality filtering analysis showing the distribution of defect types and removal rates across the 5 quality dimensions.
LLM-AEM threshold calibration across quality dimensions, balancing precision and recall for optimal sample retention.
Result Tables
Comparison with existing multilingual fake news datasets. BLUFF provides the broadest language coverage (79 languages) and largest scale (202K+ samples).
Binary veracity classification (Task 1) results across 5 encoder models and 6 training settings. S-BERT (LaBSE) achieves the best overall performance.
Binary authorship detection (Task 3) results showing human vs. machine classification accuracy across encoders and training settings.
External evaluation: Models trained on BLUFF tested on independent datasets, demonstrating strong generalization capabilities.
Complete inventory of source corpora used for generation, showing the distribution across datasets, languages, and generation models.
Leaderboard (Multilingual Setting)
Top encoder model performance in the multilingual training setting, where models are trained and evaluated on all 79 languages simultaneously:
| Rank | Model | Task 1: Veracity (F1) | Task 3: Authorship (F1) | Task 4: Attribution (F1) |
|---|---|---|---|---|
| 1 | S-BERT (LaBSE) | 97.2 | 93.2 | 82.0 |
| 2 | mDeBERTa-v3 | 98.3* | 87.3 | 80.6 |
| 3 | XLM-R-large | 84.7 | 87.3 | — |
*Big-head languages only. Full results including per-language breakdown, cross-lingual transfer, and decoder model evaluation are available in the GitHub repository.
Transfer by language family (10 models)
Transfer by script type (10 models)
Generation Models
BLUFF leverages 19 frontier multilingual LLMs spanning both instruction-tuned and reasoning-focused architectures:
Instruction-Tuned LLMs (13)
- GPT-4o, GPT-4o-mini
- Claude 3.5 Sonnet, Claude 3.5 Haiku
- Gemini 2.0 Flash
- Llama 3.1 (8B, 70B), Llama 3.3 70B
- Qwen 2.5 (7B, 72B)
- Aya Expanse (8B, 32B)
- Mistral Small 24B
Reasoning LRMs (6)
- o3-mini
- DeepSeek R1
- DeepSeek R1 Distill (Qwen 32B, Llama 70B)
- QwQ 32B
- Gemini 2.0 Flash Thinking
Citation
Paper currently under review (2026, Datasets and Benchmarks Track). Citation will be provided upon acceptance.