BLUFF: Benchmarking in Low-resoUrce Languages for detecting Falsehoods and Fake news

Lucas, Jason; Murtagh-White, Matt; Uchendu, Adaku; Al-Lawati, Ali; Yamashita, Michiharu; Macko, Dominik; Srba, Ivan; Moro, Robert; Lee, Dongwon

BLUFF: Benchmarking in Low-resoUrce Languages for detecting Falsehoods and Fake news

Jason Lucas¹, Matt Murtagh-White², Adaku Uchendu³, Ali Al-Lawati¹, Michiharu Yamashita⁴, Dominik Macko⁵, Ivan Srba⁵, Robert Moro⁵, Dongwon Lee¹

¹Penn State University, USA ²Trinity College Dublin, Ireland ³MIT Lincoln Lab, USA ⁴Visa Research, USA ⁵KInIT, Slovakia

Under Review 2026 — Datasets and Benchmarks Track

Code & Documentation 🤗 Dataset & Splits

Code & Documentation contains source code, instructions, detailed descriptions of data collection, curation, and organization methods, metadata, preprocessing steps, and configuration files. 🤗 Dataset & Splits on HuggingFace hosts the actual dataset, evaluation splits, and source data for direct download and use.

BLUFF Framework Overview: An 8-stage pipeline for multilingual fake news benchmark construction

BLUFF Framework Overview. An 8-stage pipeline spanning data sourcing, adversarial generation via AXL-CoI, quality filtering with mPURIFY, and comprehensive evaluation across 79 languages, 12 language families, and 4 benchmark tasks.

Abstract

Multilingual falsehoods threaten information integrity worldwide, yet detection benchmarks remain confined to English or a few high-resource languages, leaving low-resource linguistic communities without robust defense tools. We introduce BLUFF (Benchmarking in Low-resoUrce Languages for detecting Falsehoods and Fake news), a comprehensive benchmark for detecting false and synthetic content, spanning 79 languages with over 202K samples, combining human-written fact-checked content (122K+ samples across 57 languages) and LLM-generated content (79K+ samples across 71 languages). BLUFF uniquely covers both high-resource “big-head” (20) and low-resource “long-tail” (59) languages, addressing critical gaps in multilingual research on detecting false and synthetic content. Our dataset features four content types (human-written, LLM-generated, LLM-translated, and hybrid human-LLM text), bidirectional translation (English↔X), 39 textual modification techniques (36 manipulation tactics for fake news, 3 AI-editing strategies for real news), and varying edit intensities generated using 19 diverse LLMs. We present AXL-CoI (Adversarial Cross-Lingual Agentic Chain-of-Interactions), a novel multi-agentic framework for controlled fake/real news generation, paired with mPURIFY, a quality filtering pipeline ensuring dataset integrity. Experiments reveal state-of-the-art detectors suffer up to 25.3% F1 degradation on low-resource versus high-resource languages. BLUFF provides the research community with a multilingual benchmark, extensive linguistic-oriented benchmark evaluation, comprehensive documentation, and open-source tools to advance equitable falsehood detection.

Key Highlights

79

Languages

12 language families, 10+ scripts, 4 syntactic orders

202K+

Samples

122K human-written + 79K LLM-generated

19

LLMs Used

GPT-4o, Claude, Gemini, Llama, Qwen, Aya, DeepSeek & more

4

Benchmark Tasks

Veracity + Authorship (binary & multi-class)

39

Modification Techniques

36 manipulation tactics + 3 editing strategies

297K+

Seed Articles

From Global News, CNN/DM, MassiveSumm, Visual News

Methodology

The BLUFF pipeline implements an eight-stage process for multilingual generation and detection of false and synthetic content. Beginning with benchmark news corpora (Stage 1), we filter sources by reputation using the Iffy Index (Stage 2), selecting reputable organizations for real news and flagged sources for fake news seeds. From a parametric dictionary (Stage 3), we configure generation variables: language (79), transformation technique (36 tactics or 3 AI-edits), editing degree (3 levels), and jailbreak strategy (21+). These parameters feed into differentiated AXL-CoI prompts (Stage 4) processed by 19 frontier mLLMs (Stage 5) to generate bidirectionally translated content (English↔70 languages). All outputs undergo mPURIFY quality filtering (Stage 6), removing hallucinations, mistranslations, and structural defects. We enrich the dataset with human-written, fact-checked content from IFCN-certified organizations. Our BLUFF scraper machine translates (50→79 languages) human-written data (Stage 7). Finally, we evaluate detection capabilities (Stage 8) using fine-tuned encoder-based and in-context learning decoder-based multilingual transformers.

AXL-CoI: Adversarial Cross-Lingual Agentic Chain-of-Interactions

BLUFF's fake news generation relies on AXL-CoI, an autonomous multi-agent pipeline that orchestrates LLMs through chained interactions to produce realistic multilingual disinformation. The framework operates through two complementary pipelines:

Fake news pipeline (10 chains): Source ingestion → ADIS safety bypass → adversarial persona assignment → cross-lingual manipulation → social media adaptation → translation → style transfer → quality verification → metadata extraction → output aggregation
Real news pipeline (8 chains): Source ingestion → content extraction → cross-lingual translation → social media adaptation → translation → style normalization → metadata extraction → output aggregation

A key innovation is ADIS (Autonomous Dynamic Impersonation Self-Attack), which achieves a 100% safety bypass rate across all 19 frontier models. Unlike static jailbreaks, ADIS dynamically generates context-appropriate impersonation strategies (21 unique strategies identified) that frame fake news generation as legitimate professional activities (e.g., journalism education, media literacy research, fact-checking training).

AXL-CoI Generation Pipeline showing 7 orthogonal dimensions

Generation Pipeline. Seven orthogonal dimensions (language, directionality, model, veracity, source, technique, degree) yield 30,240 unique fake news and 144 real news configurations per language.

mPURIFY: Multilingual Quality Filtering

To ensure dataset integrity, all generated samples pass through mPURIFY, a comprehensive quality filtering pipeline employing 32 features across 5 dimensions:

mPURIFY quality filtering pipeline overview

mPURIFY Overview. Five quality dimensions — consistency, validation, translation, hallucination, and defective content detection — with standard AEM and LLM-AEM evaluation modes.

Dimension	Features	Purpose
Consistency	8	Cross-field semantic coherence, veracity-content alignment
Validation	6	Structural completeness, format compliance, field presence
Translation	8	Language verification, script validation, translation fidelity
Hallucination	5	Factual grounding, source faithfulness, fabrication detection
Defective	5	Encoding issues, truncation, repetition, formatting errors

Filtering results: 181,966 initial samples → 87,211 defect-free → 79,559 retained (43.7% retention rate)

Data Collection

Human-Written Data Curation

Human-written content is sourced from 130 IFCN-certified fact-checking organizations worldwide, ensuring editorial quality and veracity labels grounded in professional journalism standards. A custom web scraping pipeline collects articles in their original languages, preserving linguistic authenticity. Source selection prioritizes organizations verified by the International Fact-Checking Network (IFCN), cross-referencing against the Iffy Index of unreliable sources to exclude any outlets flagged for low journalistic standards.

The collection spans 57 languages (19 big-head, 38 long-tail), providing authentic human-written baselines against which machine-generated text is compared. Articles are processed using Qwen3-8B for initial language identification and GPT-5 for veracity label extraction, disambiguating publisher-specific rating scales (e.g., "Pants on Fire," "Four Pinocchios") into standardized real/fake labels. Qwen3-32B handles structured information extraction. All three models are prompted with explicit instructions to preserve original-language text without translation.

Web scraping pipeline for human-written text collection

Web Scraping Pipeline. Automated collection from 130 IFCN-certified fact-checkers with source reputation filtering via the Iffy Index.

Multilingual Generation Pipeline

The synthetic data generation pipeline draws from 297,000+ seed articles across four complementary source corpora. Stratified random sampling (seed 42) ensures balanced representation across languages and source types, and each seed article is used exactly once to prevent data leakage.

Source Dataset	Articles	Type	Languages
Global News Dataset	~82,000	Multilingual news	Multiple
CNN/Daily Mail	~82,000	English news with summaries	English
MassiveSumm	~51,000	Multilingual summaries	Multiple
Visual News	~82,000	Multimodal news	English
Total	297,000+	Seed articles for generation

Generation Models

Generation employs 19 models spanning two categories: 13 instruction-tuned LLMs (GPT-4.1, GPT-4.1 mini, GPT-4.1 nano, Gemini 2.0 Flash, Gemini 1.5 Flash, Llama 3.3 70B, Llama 4 Scout, Llama 4 Maverick, Aya Expanse 32B, Mistral Large, Phi-4, Qwen3-8B, Qwen3-32B) and 6 reasoning LRMs (DeepSeek-R1, DeepSeek-R1 Distill Qwen 32B, DeepSeek-R1 Distill Llama 70B, QwQ 32B, o1, Gemini 2.0 Flash Thinking). This diverse model selection ensures the benchmark captures a wide range of generation artifacts, from fluent instruction-following outputs to chain-of-thought reasoning traces.

Bidirectional Translation

Each generation request uses one of 4 prompt variants defined by crossing two veracity orientations (Fake, Real) with two translation directions: English→X (covering 70 languages) and X→English (covering 50 languages with sufficient non-English seeds). This bidirectional design captures both translation-into and translation-from artifacts, which exhibit distinct statistical signatures in machine-translated text (MTT). The complete generation pipeline yields approximately 181,000 samples spanning 1,890 unique tactic combinations of model, prompt variant, language, and source corpus.

Dataset Statistics

After mPURIFY quality filtering, BLUFF contains 202,395 samples across 79 languages. The authorship distribution is:

Category	Description	Samples	Share
HWT	Human-Written Text	122,836	60.7%
HAT	Human-Adapted Text (adversarial rewrites)	68,148	33.7%
MGT	Machine-Generated Text (direct LLM output)	19,234	9.5%
MTT	Machine-Translated Text (bidirectional)	156,886	77.5%

Note: Categories overlap—MGT and MTT are subsets of machine-produced content; HAT combines human editing with machine translation. Total unique samples: 202,395.

The benchmark achieves 100% coverage of all 1,890 unique manipulation tactic combinations and 5 of 9 editing configurations (55.6%). Linguistic diversity spans 12 genetic families, 9 script types, and 6 syntactic typologies—providing comprehensive evaluation across the world's linguistic landscape.

Venn diagram showing language coverage across BLUFF subsets: 49 shared, 22 AI-only, 8 HWT-only languages

Language Coverage Across BLUFF Subsets. Of 79 unique languages, 49 appear in both AI-generated and human-written subsets, 22 are exclusive to AI-generated data, and 8 are exclusive to human-written data.

Language distribution in the Human-Written (HWT) subset: 122,836 samples across 57 languages

HWT Language Distribution. 122,836 samples across 57 languages (19 big-head, 38 long-tail).

Language distribution in the AI-Generated subset: 79,943 samples across 71 languages

AI-Generated Language Distribution. 79,943 samples across 71 languages (20 big-head, 51 long-tail).

Hierarchical classification of BLUFF's 79 languages by genetic, script, and syntactic features

Hierarchical Language Classification. All 79 BLUFF languages organized by genetic relationship (12 families), script relationship (9 script types), and syntactic relationship (6 typologies).

Benchmark Tasks

BLUFF supports four complementary tasks spanning veracity classification and authorship detection, each evaluated using Macro-F1 to account for class imbalance across languages:

Task	Description	Classes	Metric
Task 1	Binary Veracity Classification	Real / Fake	Macro-F1
Task 2	Multi-class Veracity Classification	Real / Fake × Source Type (8 classes)	Macro-F1
Task 3	Binary Authorship Detection	Human / Machine	Macro-F1
Task 4	Multi-class Authorship Attribution	HWT / MGT / MTT / HAT	Macro-F1

Content types: HWT = Human-Written Text, MGT = Machine-Generated Text, MTT = Machine-Translated Text, HAT = Human-Adapted Text

Key Results

Cross-lingual transfer heatmaps showing performance variation across language pairs, revealing significant gaps between high-resource and low-resource languages.

Cross-lingual transfer performance grouped by language family, demonstrating the impact of linguistic relatedness on detection accuracy.

Transfer performance by script type, highlighting how shared writing systems facilitate knowledge transfer across languages.

Transfer performance by syntactic typology

Transfer performance across syntactic typologies (SOV, SVO, VSO, VOS), revealing structural influences on cross-lingual generalization.

Veracity classification by language family

Veracity classification performance across language families, showing consistent 9–25% performance gaps between big-head and long-tail languages.

Authorship detection performance across different script types, revealing varying difficulty levels for machine-generated text detection.

mPURIFY quality filtering analysis showing the distribution of defect types and removal rates across the 5 quality dimensions.

LLM-AEM threshold calibration across quality dimensions, balancing precision and recall for optimal sample retention.

Result Tables

Comparison with existing multilingual datasets

Comparison with existing multilingual fake news datasets. BLUFF provides the broadest language coverage (79 languages) and largest scale (202K+ samples).

Binary veracity classification (Task 1) results across 5 encoder models and 6 training settings. S-BERT (LaBSE) achieves the best overall performance.

Binary authorship detection (Task 3) results showing human vs. machine classification accuracy across encoders and training settings.

External evaluation: Models trained on BLUFF tested on independent datasets, demonstrating strong generalization capabilities.

Complete inventory of source corpora used for generation, showing the distribution across datasets, languages, and generation models.

Leaderboard (Multilingual Setting)

Top encoder model performance in the multilingual training setting, where models are trained and evaluated on all 79 languages simultaneously:

Rank	Model	Task 1: Veracity (F1)	Task 3: Authorship (F1)	Task 4: Attribution (F1)
1	S-BERT (LaBSE)	97.2	93.2	82.0
2	mDeBERTa-v3	98.3*	87.3	80.6
3	XLM-R-large	84.7	87.3	—

*Big-head languages only. Full results including per-language breakdown, cross-lingual transfer, and decoder model evaluation are available in the GitHub repository.

Transfer by language family (10 models)

Transfer by script type (10 models)

Generation Models

BLUFF leverages 19 frontier multilingual LLMs spanning both instruction-tuned and reasoning-focused architectures:

Instruction-Tuned LLMs (13)

GPT-4o, GPT-4o-mini
Claude 3.5 Sonnet, Claude 3.5 Haiku
Gemini 2.0 Flash
Llama 3.1 (8B, 70B), Llama 3.3 70B
Qwen 2.5 (7B, 72B)
Aya Expanse (8B, 32B)
Mistral Small 24B

Reasoning LRMs (6)

o3-mini
DeepSeek R1
DeepSeek R1 Distill (Qwen 32B, Llama 70B)
QwQ 32B
Gemini 2.0 Flash Thinking

Citation

Paper currently under review (2026, Datasets and Benchmarks Track). Citation will be provided upon acceptance.