BLUFF: Benchmarking in Low-resoUrce Languages for detecting Falsehoods and Fake news

Jason Lucas1, Matt Murtagh-White2, Adaku Uchendu3, Ali Al-Lawati1, Michiharu Yamashita4, Dominik Macko5, Ivan Srba5, Robert Moro5, Dongwon Lee1
1Penn State University, USA    2Trinity College Dublin, Ireland    3MIT Lincoln Lab, USA    4Visa Research, USA    5KInIT, Slovakia
Under Review 2026 — Datasets and Benchmarks Track

Code & Documentation contains source code, instructions, detailed descriptions of data collection, curation, and organization methods, metadata, preprocessing steps, and configuration files. 🤗 Dataset & Splits on HuggingFace hosts the actual dataset, evaluation splits, and source data for direct download and use.

BLUFF Framework Overview: An 8-stage pipeline for multilingual fake news benchmark construction

BLUFF Framework Overview. An 8-stage pipeline spanning data sourcing, adversarial generation via AXL-CoI, quality filtering with mPURIFY, and comprehensive evaluation across 79 languages, 12 language families, and 4 benchmark tasks.

Abstract

Multilingual falsehoods threaten information integrity worldwide, yet detection benchmarks remain confined to English or a few high-resource languages, leaving low-resource linguistic communities without robust defense tools. We introduce BLUFF (Benchmarking in Low-resoUrce Languages for detecting Falsehoods and Fake news), a comprehensive benchmark for detecting false and synthetic content, spanning 79 languages with over 202K samples, combining human-written fact-checked content (122K+ samples across 57 languages) and LLM-generated content (79K+ samples across 71 languages). BLUFF uniquely covers both high-resource “big-head” (20) and low-resource “long-tail” (59) languages, addressing critical gaps in multilingual research on detecting false and synthetic content. Our dataset features four content types (human-written, LLM-generated, LLM-translated, and hybrid human-LLM text), bidirectional translation (English↔X), 39 textual modification techniques (36 manipulation tactics for fake news, 3 AI-editing strategies for real news), and varying edit intensities generated using 19 diverse LLMs. We present AXL-CoI (Adversarial Cross-Lingual Agentic Chain-of-Interactions), a novel multi-agentic framework for controlled fake/real news generation, paired with mPURIFY, a quality filtering pipeline ensuring dataset integrity. Experiments reveal state-of-the-art detectors suffer up to 25.3% F1 degradation on low-resource versus high-resource languages. BLUFF provides the research community with a multilingual benchmark, extensive linguistic-oriented benchmark evaluation, comprehensive documentation, and open-source tools to advance equitable falsehood detection.

Key Highlights

79

Languages

12 language families, 10+ scripts, 4 syntactic orders

202K+

Samples

122K human-written + 79K LLM-generated

19

LLMs Used

GPT-4o, Claude, Gemini, Llama, Qwen, Aya, DeepSeek & more

4

Benchmark Tasks

Veracity + Authorship (binary & multi-class)

39

Modification Techniques

36 manipulation tactics + 3 editing strategies

297K+

Seed Articles

From Global News, CNN/DM, MassiveSumm, Visual News

Methodology

The BLUFF pipeline implements an eight-stage process for multilingual generation and detection of false and synthetic content. Beginning with benchmark news corpora (Stage 1), we filter sources by reputation using the Iffy Index (Stage 2), selecting reputable organizations for real news and flagged sources for fake news seeds. From a parametric dictionary (Stage 3), we configure generation variables: language (79), transformation technique (36 tactics or 3 AI-edits), editing degree (3 levels), and jailbreak strategy (21+). These parameters feed into differentiated AXL-CoI prompts (Stage 4) processed by 19 frontier mLLMs (Stage 5) to generate bidirectionally translated content (English↔70 languages). All outputs undergo mPURIFY quality filtering (Stage 6), removing hallucinations, mistranslations, and structural defects. We enrich the dataset with human-written, fact-checked content from IFCN-certified organizations. Our BLUFF scraper machine translates (50→79 languages) human-written data (Stage 7). Finally, we evaluate detection capabilities (Stage 8) using fine-tuned encoder-based and in-context learning decoder-based multilingual transformers.

AXL-CoI: Adversarial Cross-Lingual Agentic Chain-of-Interactions

BLUFF's fake news generation relies on AXL-CoI, an autonomous multi-agent pipeline that orchestrates LLMs through chained interactions to produce realistic multilingual disinformation. The framework operates through two complementary pipelines:

  • Fake news pipeline (10 chains): Source ingestion → ADIS safety bypass → adversarial persona assignment → cross-lingual manipulation → social media adaptation → translation → style transfer → quality verification → metadata extraction → output aggregation
  • Real news pipeline (8 chains): Source ingestion → content extraction → cross-lingual translation → social media adaptation → translation → style normalization → metadata extraction → output aggregation

A key innovation is ADIS (Autonomous Dynamic Impersonation Self-Attack), which achieves a 100% safety bypass rate across all 19 frontier models. Unlike static jailbreaks, ADIS dynamically generates context-appropriate impersonation strategies (21 unique strategies identified) that frame fake news generation as legitimate professional activities (e.g., journalism education, media literacy research, fact-checking training).

AXL-CoI Generation Pipeline showing 7 orthogonal dimensions

Generation Pipeline. Seven orthogonal dimensions (language, directionality, model, veracity, source, technique, degree) yield 30,240 unique fake news and 144 real news configurations per language.

mPURIFY: Multilingual Quality Filtering

To ensure dataset integrity, all generated samples pass through mPURIFY, a comprehensive quality filtering pipeline employing 32 features across 5 dimensions:

mPURIFY quality filtering pipeline overview

mPURIFY Overview. Five quality dimensions — consistency, validation, translation, hallucination, and defective content detection — with standard AEM and LLM-AEM evaluation modes.

Dimension Features Purpose
Consistency 8 Cross-field semantic coherence, veracity-content alignment
Validation 6 Structural completeness, format compliance, field presence
Translation 8 Language verification, script validation, translation fidelity
Hallucination 5 Factual grounding, source faithfulness, fabrication detection
Defective 5 Encoding issues, truncation, repetition, formatting errors

Filtering results: 181,966 initial samples → 87,211 defect-free → 79,559 retained (43.7% retention rate)

Data Collection

Human-Written Data Curation

Human-written content is sourced from 130 IFCN-certified fact-checking organizations worldwide, ensuring editorial quality and veracity labels grounded in professional journalism standards. A custom web scraping pipeline collects articles in their original languages, preserving linguistic authenticity. Source selection prioritizes organizations verified by the International Fact-Checking Network (IFCN), cross-referencing against the Iffy Index of unreliable sources to exclude any outlets flagged for low journalistic standards.

The collection spans 57 languages (19 big-head, 38 long-tail), providing authentic human-written baselines against which machine-generated text is compared. Articles are processed using Qwen3-8B for initial language identification and GPT-5 for veracity label extraction, disambiguating publisher-specific rating scales (e.g., "Pants on Fire," "Four Pinocchios") into standardized real/fake labels. Qwen3-32B handles structured information extraction. All three models are prompted with explicit instructions to preserve original-language text without translation.

Web scraping pipeline for human-written text collection

Web Scraping Pipeline. Automated collection from 130 IFCN-certified fact-checkers with source reputation filtering via the Iffy Index.

Multilingual Generation Pipeline

The synthetic data generation pipeline draws from 297,000+ seed articles across four complementary source corpora. Stratified random sampling (seed 42) ensures balanced representation across languages and source types, and each seed article is used exactly once to prevent data leakage.

Source Dataset Articles Type Languages
Global News Dataset ~82,000 Multilingual news Multiple
CNN/Daily Mail ~82,000 English news with summaries English
MassiveSumm ~51,000 Multilingual summaries Multiple
Visual News ~82,000 Multimodal news English
Total 297,000+ Seed articles for generation

Generation Models

Generation employs 19 models spanning two categories: 13 instruction-tuned LLMs (GPT-4.1, GPT-4.1 mini, GPT-4.1 nano, Gemini 2.0 Flash, Gemini 1.5 Flash, Llama 3.3 70B, Llama 4 Scout, Llama 4 Maverick, Aya Expanse 32B, Mistral Large, Phi-4, Qwen3-8B, Qwen3-32B) and 6 reasoning LRMs (DeepSeek-R1, DeepSeek-R1 Distill Qwen 32B, DeepSeek-R1 Distill Llama 70B, QwQ 32B, o1, Gemini 2.0 Flash Thinking). This diverse model selection ensures the benchmark captures a wide range of generation artifacts, from fluent instruction-following outputs to chain-of-thought reasoning traces.

Bidirectional Translation

Each generation request uses one of 4 prompt variants defined by crossing two veracity orientations (Fake, Real) with two translation directions: English→X (covering 70 languages) and X→English (covering 50 languages with sufficient non-English seeds). This bidirectional design captures both translation-into and translation-from artifacts, which exhibit distinct statistical signatures in machine-translated text (MTT). The complete generation pipeline yields approximately 181,000 samples spanning 1,890 unique tactic combinations of model, prompt variant, language, and source corpus.

Dataset Statistics

After mPURIFY quality filtering, BLUFF contains 202,395 samples across 79 languages. The authorship distribution is:

Category Description Samples Share
HWT Human-Written Text 122,836 60.7%
HAT Human-Adapted Text (adversarial rewrites) 68,148 33.7%
MGT Machine-Generated Text (direct LLM output) 19,234 9.5%
MTT Machine-Translated Text (bidirectional) 156,886 77.5%

Note: Categories overlap—MGT and MTT are subsets of machine-produced content; HAT combines human editing with machine translation. Total unique samples: 202,395.

The benchmark achieves 100% coverage of all 1,890 unique manipulation tactic combinations and 5 of 9 editing configurations (55.6%). Linguistic diversity spans 12 genetic families, 9 script types, and 6 syntactic typologies—providing comprehensive evaluation across the world's linguistic landscape.

Venn diagram showing language coverage across BLUFF subsets: 49 shared, 22 AI-only, 8 HWT-only languages

Language Coverage Across BLUFF Subsets. Of 79 unique languages, 49 appear in both AI-generated and human-written subsets, 22 are exclusive to AI-generated data, and 8 are exclusive to human-written data.

Language distribution in the Human-Written (HWT) subset: 122,836 samples across 57 languages

HWT Language Distribution. 122,836 samples across 57 languages (19 big-head, 38 long-tail).

Language distribution in the AI-Generated subset: 79,943 samples across 71 languages

AI-Generated Language Distribution. 79,943 samples across 71 languages (20 big-head, 51 long-tail).

Hierarchical classification of BLUFF's 79 languages by genetic, script, and syntactic features

Hierarchical Language Classification. All 79 BLUFF languages organized by genetic relationship (12 families), script relationship (9 script types), and syntactic relationship (6 typologies).

Benchmark Tasks

BLUFF supports four complementary tasks spanning veracity classification and authorship detection, each evaluated using Macro-F1 to account for class imbalance across languages:

Task Description Classes Metric
Task 1 Binary Veracity Classification Real / Fake Macro-F1
Task 2 Multi-class Veracity Classification Real / Fake × Source Type (8 classes) Macro-F1
Task 3 Binary Authorship Detection Human / Machine Macro-F1
Task 4 Multi-class Authorship Attribution HWT / MGT / MTT / HAT Macro-F1

Content types: HWT = Human-Written Text, MGT = Machine-Generated Text, MTT = Machine-Translated Text, HAT = Human-Adapted Text

Key Results

Result Tables

Leaderboard (Multilingual Setting)

Top encoder model performance in the multilingual training setting, where models are trained and evaluated on all 79 languages simultaneously:

Rank Model Task 1: Veracity (F1) Task 3: Authorship (F1) Task 4: Attribution (F1)
1 S-BERT (LaBSE) 97.2 93.2 82.0
2 mDeBERTa-v3 98.3* 87.3 80.6
3 XLM-R-large 84.7 87.3

*Big-head languages only. Full results including per-language breakdown, cross-lingual transfer, and decoder model evaluation are available in the GitHub repository.

Cross-lingual transfer by language family (10 models)

Transfer by language family (10 models)

Cross-lingual transfer by script type (10 models)

Transfer by script type (10 models)

Generation Models

BLUFF leverages 19 frontier multilingual LLMs spanning both instruction-tuned and reasoning-focused architectures:

Instruction-Tuned LLMs (13)

  • GPT-4o, GPT-4o-mini
  • Claude 3.5 Sonnet, Claude 3.5 Haiku
  • Gemini 2.0 Flash
  • Llama 3.1 (8B, 70B), Llama 3.3 70B
  • Qwen 2.5 (7B, 72B)
  • Aya Expanse (8B, 32B)
  • Mistral Small 24B

Reasoning LRMs (6)

  • o3-mini
  • DeepSeek R1
  • DeepSeek R1 Distill (Qwen 32B, Llama 70B)
  • QwQ 32B
  • Gemini 2.0 Flash Thinking

Citation

Paper currently under review (2026, Datasets and Benchmarks Track). Citation will be provided upon acceptance.