LiveSearchBench LiveSearchBench

LiveSearchBench: An Automatically Constructed Benchmark for Retrieval and Reasoning over Dynamic Knowledge

Heng Zhou1,2,*, Ao Yu1,*, Yuchen Fan1,3,*, Jianing Shi4, Li Kang1,3

Hejia Geng8, Yongting Zhang1, Yutao Fan2,6, Yuhao Wu7, Tiancheng He5

Yiran Qin2, Lei Bai2,†, Zhenfei Yin8,†

1University of Science and Technology of China    2Shanghai AI Laboratory

3Shanghai Jiao Tong University    4London School of Economics    5BUPT

6Harbin Institute of Technology    7SUTD    8University of Oxford

*Equal contributions    Corresponding author

A continually updated benchmark that shifts evaluation from static memorization toward tasks requiring up-to-date retrieval and reasoning

Pipeline Overview

LiveSearchBench Pipeline

Benchmark Statistics

600
Total Questions
Across 2021 & 2025 batches
3
Difficulty Levels
L1: Single-hop, L2: Multi-constraint, L3: Multi-hop+Fuzz
100%
SPARQL Validated
Every question has unique answer
52.9%
Performance Drop
2021 → 2025 average degradation

Note: The 2025 batch represents genuinely novel knowledge that post-dates model training, while the 2021 batch may overlap with pretraining data. The dramatic performance differences highlight the challenge of dynamic knowledge evaluation.

Key Experimental Results

📈 Retrieval vs. No Retrieval

2021 Batch Improvement +137.8%
2025 Batch Improvement +308.3%

Retrieval shows dramatically higher relative improvements on novel 2025 knowledge

🎯 Difficulty Level Analysis

Level 1 (Single-hop) Highest accuracy
Level 2 (Multi-constraint) Moderate difficulty
Level 3 (Multi-hop+Fuzz) Most challenging

Performance typically declines from L1 to L3, reflecting greater sensitivity to retrieval precision

LiveSearchBench

About

🔑 Key Findings

  • Pronounced performance drop when models confront facts that post-date pretraining
  • Gap most salient on multi-hop queries requiring complex reasoning
  • Retrieval-augmented methods provide partial gains but fail to close the recency gap
  • Parametric-only inference often matches retrieval on static datasets due to memorization

LiveSearchBench Pipeline

📊

1. Differential Knowledge Extraction

Compute deltas between successive Wikidata snapshots (T₀ → T₁) to identify new & updated facts

🔍

2. Candidate Filtering

Apply relation allow-lists, entity quality checks, and statement validity filters

🎯

3. Hierarchical Question Synthesis

Generate L1 (single-hop), L2 (multi-constraint), L3 (multi-hop+fuzz) questions

4. Finalization & Validation

SPARQL validation ensures each question admits exactly one verifiable answer

Question Complexity Examples

Level 1: Single-Hop

Q: "In which country will the ICLR2026 conference be held?"

A: Brazil

Based on triple: (ICLR2026, country, Brazil)

Level 2: Multi-Constraint

Q: "Which football player has played for Real Madrid, Juventus, and Al Nassr?"

A: Cristiano Ronaldo

Requires intersection of multiple constraints, based on (Real Madrid, player, Cristiano Ronaldo), (Juventus , player, Cristiano Ronaldo), (Al Nassr , player, Cristiano Ronaldo)

Level 3: Multi-Hop + Attribute Fuzzing

Q: "Which football player has played for Real Madrid, Juventus, and a Saudi Arabian club?"

A: Cristiano Ronaldo

Fuzzes "Al Nassr" → "Saudi Arabian club" + additional reasoning hop

Research Contributions

🔄

Scalable Data Generation

Automated pipeline that continuously harvests questions from real-world editing streams with temporal correctness validation

📊

Comprehensive Evaluation

Extensive evaluation of state-of-the-art LLMs and RAG methods revealing strengths and limitations in dynamic knowledge handling

🌐

Community Resource

Continually updating benchmark enabling the community to track progress on retrieval-augmented methods under realistic conditions

Our experiments demonstrate that current LLMs face significant challenges when confronting knowledge that post-dates their training. The performance gap is most pronounced on multi-hop reasoning tasks, where models must integrate multiple pieces of recent information. This finding highlights the critical need for benchmarks that reflect the dynamic nature of real-world knowledge and adequately test models' ability to retrieve and reason over up-to-date information.

Leaderboard

2025 Batch: Novel Knowledge

Large Language Models

Rank Model Level 1 Level 2 Level 3 Average

Small Models with Reinforcement Learning

Rank Model Level 1 Level 2 Level 3 Average
LiveSearchBench

BibTeX

@article{zhou2025livesearchbench,
  title={LiveSearchBench: An Automatically Constructed Benchmark for Retrieval and Reasoning over Dynamic Knowledge},
  author={Zhou, Heng and Yu, Ao and Fan, Yuchen and Shi, Jianing and Kang, Li and Geng, Hejia and Zhang, Yongting and Fan, Yutao and Wu, Yuhao and He, Tiancheng and Qin, Yiran and Bai, Lei and Yin, Zhenfei},
  journal={arXiv preprint arXiv:2025.xxxxx},
  year={2025}
}