Introducing RELAI Leaderboard

Apr 22, 2025

Summary

🚀 RELAI Leaderboard

We evaluated popular LLMs on 100+ AI benchmarks generated with RELAI’s Data Agents on documentation from widely used software tools.

📌 Key Takeaways:

🏆 Top model: o4-mini (among the models we evaluated), with an average score of 74.9% across benchmarks
🔥 Most “known” tool: Express.js with top-3 models averaging 80.9% accuracy
🧠 Toughest tool: Electron, with top-3 models averaging just 56%

📅 Want to test your model? Book a slot here !

📊 Coming in our next post: How much better do retrieval or long-context models perform?

RELAI Data Agent - Recap

In the previous post, we introduced RELAI Data Agents as AI systems to convert your raw data to high-quality, complex, and grounded benchmarks ready to be used in model evaluation and optimization. To give you an estimate, data agents can generate 1,000 high quality, reasoning based samples from the raw data in about an hour!

Do LLMs Know Software Engineering?

LLMs are widely used for software-related tasks. But how much do they really understand about the tools developers rely on? Do large language models really “know” tools like React, TensorFlow, Kubernetes, LangChain, and more?

To find out, we used RELAI’s Data Agents to create benchmarks from the official documentation of 50+ popular software tools including React, TensorFlow, Kubernetes, Django, PyTorch, Node.js, LangChain, and Google Cloud, Docker, scikit-learn and others.

For each, we built a standard benchmark and a reasoning benchmark where the model must synthesize information from multiple sections of the docs to answer correctly. Each benchmark contains 1000+ samples.

This has resulted in 100+ benchmarks with 100,000+ samples (many involving complex reasoning), all open-access on our platform—and also available on Hugging Face

RELAI Leaderboard

We then evaluated several state-of-the-art foundation models — without retrieval — to test how well they understand these tools from just their weights. Tested Models include GPT-4o, o4-mini,, LLaMA 3.3 70B, Gemini 2.0 Flash, Grok 2

For evaluations, we use RELAI Evaluation Agents. These agents provide more accurate, objective model evaluations compared to subjective LLM-as-a-judge evaluation functions. Figure 1 shows a screenshot of the RELAI leaderboard.

Figure 1. A snapshot of RELAI Leaderboard. For a live view, visit: relai.ai

Results

Figure 2 illustrates the average performance of various models across a broad set of benchmarks, including specific categories such as Frontend, Backend, AI/ML, and DevOps. 📊 In total, around 100 AI-generated benchmarks were evaluated, sourced from RELAI’s Data Agents. The top-performing model was o4-mini, with an impressive average accuracy of 74.9%, followed closely by Grok 2 at 69.2%. While models performed relatively well on tools like Express.js —averaging 80.9% accuracy among the top three—the DevOps category emerged as the most challenging, with the top three models scoring only 67.3% on average. Notably, Electron stood out as the toughest individual tool, with leading models achieving just 56% accuracy (Figure 3).

Figure 2. Performance of various LLMs across a broad set of benchmarks

Figure 3. Performance of various LLMs on benchmarks generated for various software tools