Maestro - The Holistic Optimizer for AI Agents

Agentic AI is revolutionizing industries—from smart assistants and HR automation to summarization and IT ticketing—yet real‑world deployments still struggle with hallucinations, tool misuse, stochastic behavior and inconsistent I/O. Maestro solves this by optimizing your agent’s entire execution graph—nodes, relationships and state—alongside surface‑level tuning, all tailored to your data and objectives. The result? Agents that perform with consistent accuracy and resilience in production.

Critico Agents- Quality Assessment of AI

Evaluating agentic solutions requires domain‑tailored benchmarks, rich datasets, and powerful evaluators that can measure correctness, completeness, hallucinations, style, and format. Critico unifies your data assets and benchmark suites with a library of customizable evaluation functions—whether you use our RAG, hallucination, completeness, or format/style evaluators, or plug in your own—so you can quantify strengths, diagnose weaknesses, and iterate more effectively.

from relai.critico import Critico

# Initialize Critico with the client
critico = Critico(client=client)

# Add evaluators
critico.add_evaluators({PriceFormatEvaluator(): 1.0})

# Evaluate agent logs
critico_logs = await critico.evaluate(agent_logs)

Agent Simulator — Create Flexible Simulation Environments for Agents

Simulating multi-turn agentic conversations—wired to tools and MCP servers—and capturing execution traces is time-intensive. Agent Simulator spins up configurable LLM personas, mock MCP servers/tools, and production-like data to generate rich interaction logs in minutes. With conditional simulations that mirror production states, it reproduces real-world variability while staying controllable and replayable—so you can stress-test flows and optimize agents before they reach production.

Data Agents – Automated Benchmark Creation

Hand‑crafting application‑specific benchmarks takes months, and public datasets often miss the mark for your domain. Data Agents automate this entire process: they ingest your raw data and instructions, then generate complex, grounded, reasoning benchmarks and annotated samples. To date, our Data Agents have produced over 100 benchmarks and 100,000 evaluation samples, empowering you to validate and refine RAGs, agentic RAGs, and beyond.access data on hugging face

RELAI Leaderboard

The leaderboard shows the performance of popular large language models on public data agents.

Domains

	Model	Avg Score