LiveSearchBench: An Automatically Constructed Benchmark for Retrieval and Reasoning over Dynamic Knowledge

Heng Zhou^1,2,*, Ao Yu^1,*, Yuchen Fan^1,3,*, Jianing Shi⁴, Li Kang^1,3

Hejia Geng⁸, Yongting Zhang¹, Yutao Fan^2,6, Yuhao Wu⁷, Tiancheng He⁵

Yiran Qin², Lei Bai^2,†, Zhenfei Yin^8,†

¹University of Science and Technology of China ²Shanghai AI Laboratory

³Shanghai Jiao Tong University ⁴London School of Economics ⁵BUPT

⁶Harbin Institute of Technology ⁷SUTD ⁸University of Oxford

^*Equal contributions ^†Corresponding author

A continually updated benchmark that shifts evaluation from static memorization toward tasks requiring up-to-date retrieval and reasoning

arXiv Code

📊

Data(coming soon)

📄

Cite

📧

Contact

Pipeline Overview

Benchmark Statistics

600

Total Questions

Across 2021 & 2025 batches

3

Difficulty Levels

L1: Single-hop, L2: Multi-constraint, L3: Multi-hop+Fuzz

100%

SPARQL Validated

Every question has unique answer

52.9%

Performance Drop

2021 → 2025 average degradation

Note: The 2025 batch represents genuinely novel knowledge that post-dates model training, while the 2021 batch may overlap with pretraining data. The dramatic performance differences highlight the challenge of dynamic knowledge evaluation.

Key Experimental Results

📈 Retrieval vs. No Retrieval

2021 Batch Improvement +137.8%

2025 Batch Improvement +308.3%

Retrieval shows dramatically higher relative improvements on novel 2025 knowledge

🎯 Difficulty Level Analysis

Level 1 (Single-hop) Highest accuracy

Level 2 (Multi-constraint) Moderate difficulty

Level 3 (Multi-hop+Fuzz) Most challenging

Performance typically declines from L1 to L3, reflecting greater sensitivity to retrieval precision

About

🔑 Key Findings

• Pronounced performance drop when models confront facts that post-date pretraining
• Gap most salient on multi-hop queries requiring complex reasoning
• Retrieval-augmented methods provide partial gains but fail to close the recency gap
• Parametric-only inference often matches retrieval on static datasets due to memorization

LiveSearchBench Pipeline

📊

1. Differential Knowledge Extraction

Compute deltas between successive Wikidata snapshots (T₀ → T₁) to identify new & updated facts

🔍

2. Candidate Filtering

Apply relation allow-lists, entity quality checks, and statement validity filters

🎯

3. Hierarchical Question Synthesis

Generate L1 (single-hop), L2 (multi-constraint), L3 (multi-hop+fuzz) questions

✅

4. Finalization & Validation

SPARQL validation ensures each question admits exactly one verifiable answer

Question Complexity Examples

Level 1: Single-Hop

Q: "In which country will the ICLR2026 conference be held?"

A: Brazil

Based on triple: (ICLR2026, country, Brazil)

Level 2: Multi-Constraint

Q: "Which football player has played for Real Madrid, Juventus, and Al Nassr?"

A: Cristiano Ronaldo

Requires intersection of multiple constraints, based on (Real Madrid, player, Cristiano Ronaldo), (Juventus , player, Cristiano Ronaldo), (Al Nassr , player, Cristiano Ronaldo)

Level 3: Multi-Hop + Attribute Fuzzing

Q: "Which football player has played for Real Madrid, Juventus, and a Saudi Arabian club?"

A: Cristiano Ronaldo

Fuzzes "Al Nassr" → "Saudi Arabian club" + additional reasoning hop

Research Contributions

🔄

Scalable Data Generation

Automated pipeline that continuously harvests questions from real-world editing streams with temporal correctness validation

📊

Comprehensive Evaluation

Extensive evaluation of state-of-the-art LLMs and RAG methods revealing strengths and limitations in dynamic knowledge handling

🌐

Community Resource

Continually updating benchmark enabling the community to track progress on retrieval-augmented methods under realistic conditions

Our experiments demonstrate that current LLMs face significant challenges when confronting knowledge that post-dates their training. The performance gap is most pronounced on multi-hop reasoning tasks, where models must integrate multiple pieces of recent information. This finding highlights the critical need for benchmarks that reflect the dynamic nature of real-world knowledge and adequately test models' ability to retrieve and reason over up-to-date information.

Leaderboard

2025 Batch: Novel Knowledge

Large Language Models

Rank	Model	Level 1↕	Level 2↕	Level 3↕	Average↓

Small Models with Reinforcement Learning

Rank	Model	Level 1↕	Level 2↕	Level 3↕	Average↓

BibTeX

@article{zhou2025livesearchbench,
  title={LiveSearchBench: An Automatically Constructed Benchmark for Retrieval and Reasoning over Dynamic Knowledge},
  author={Zhou, Heng and Yu, Ao and Fan, Yuchen and Shi, Jianing and Kang, Li and Geng, Hejia and Zhang, Yongting and Fan, Yutao and Wu, Yuhao and He, Tiancheng and Qin, Yiran and Bai, Lei and Yin, Zhenfei},
  journal={arXiv preprint arXiv:2025.xxxxx},
  year={2025}
}