Tech stories

Data Agent Benchmark for Multi-step Reasoning (🕺 DABStep)

Alex Egg (Adyen), Martin Iglesias (Adyen), Friso Kingma (Adyen), Andreu Mora (Adyen), Leandro Von Werra (HuggingFace), Thomas Wolf (HuggingFace)

February 4, 2025
 ·  10 minutes

👉🏽 You can access DABStep with this link here: https://huggingface.co/spaces/adyen/DABstep

Language models are becoming increasingly capable and can solve tasks autonomously as agents. There are many exciting use cases, especially at the intersection of reasoning, code, and data. However, proper evaluation benchmarks on real-world problems are lacking and hinder progress in the field.

To tackle this challenge, Adyen and Hugging Face built the Data Agent Benchmark for Multi-step Reasoning (DABstep) together. DABstep consists of over 450 data analysis tasks designed to evaluate the capabilities of state-of-the-art LLMs and AI agents.

Our findings reveal that DABstep presents a significant challenge for current AI models, with the most capable Reasoning-based agents achieving only 16% accuracy, highlighting significant progress to be made in the field.

DABStep requires AI models to:

  • dive in details of data and be rigorous (no hallucinations)

  • reason over free form text and databases

  • connect with real life use-cases (not just math or code)

In this blog post, we’ll cover the design and construction of the benchmark, explore evaluation results, and discuss the significant gap between current models and the ability to solve complex data analysis tasks effectively.

Fresh insights, straight to your inbox

By submitting your information you confirm that you have read Adyen's Privacy Policy and agree to the use of your data in all Adyen communications.