简单的 Agent 评估

评估 (Evals) 是 Agent 和 Team 的单元测试，请明智地使用它们来衡量和改进它们的性能。Agno 提供了 3 个评估 Agent 的维度：

准确性 (Accuracy): Agent 的响应在多大程度上是完整/正确/精确的（LLM-as-a-judge）
性能 (Performance): Agent 的响应速度有多快，以及内存占用是多少？
可靠性 (Reliability): Agent 是否进行了预期的工具调用？

准确性

准确性评估使用输入/输出对来衡量您的 Agent 和 Team 相对于黄金标准答案的性能。使用一个更大的模型来评估 Agent 的响应（LLM-as-a-judge）。

示例

在此示例中，AccuracyEval 将使用输入运行 Agent，然后使用一个更大的模型（o4-mini）根据提供的指南来评估 Agent 的响应。

calculate_accuracy.py

from typing import Optional
from agno.agent import Agent
from agno.eval.accuracy import AccuracyEval, AccuracyResult
from agno.models.openai import OpenAIChat
from agno.tools.calculator import CalculatorTools

evaluation = AccuracyEval(
    model=OpenAIChat(id="o4-mini"),
    agent=Agent(model=OpenAIChat(id="gpt-4o"), tools=[CalculatorTools(enable_all=True)]),
    input="What is 10*5 then to the power of 2? do it step by step",
    expected_output="2500",
    additional_guidelines="Agent output should include the steps and the final answer.",
)

result: Optional[AccuracyResult] = evaluation.run(print_results=True)
assert result is not None and result.avg_score >= 8

您也可以对现有输出运行 AccuracyEval（无需运行 Agent）。

accuracy_eval_with_output.py

from typing import Optional

from agno.agent import Agent
from agno.eval.accuracy import AccuracyEval, AccuracyResult
from agno.models.openai import OpenAIChat
from agno.tools.calculator import CalculatorTools

evaluation = AccuracyEval(
    model=OpenAIChat(id="o4-mini"),
    input="What is 10*5 then to the power of 2? do it step by step",
    expected_output="2500",
    num_iterations=1,
)
result_with_given_answer: Optional[AccuracyResult] = evaluation.run_with_output(
    output="2500", print_results=True
)
assert result_with_given_answer is not None and result_with_given_answer.avg_score >= 8

性能

性能评估可以衡量 Agent 或 Team 的延迟和内存占用。

虽然延迟主要受模型 API 响应时间的影响，但我们仍应关注性能，并跟踪 Agent 或 Team 在有或没有某些组件时的性能。例如：了解在有或没有存储、内存，使用新提示或新模型时的平均延迟会很有用。

示例

storage_performance.py

"""Run `pip install openai agno` to install dependencies."""

from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.eval.perf import PerfEval

def simple_response():
    agent = Agent(model=OpenAIChat(id='gpt-4o-mini'), system_message='Be concise, reply with one sentence.', add_history_to_messages=True)
    response_1 = agent.run('What is the capital of France?')
    print(response_1.content)
    response_2 = agent.run('How many people live there?')
    print(response_2.content)
    return response_2.content


simple_response_perf = PerfEval(func=simple_response, num_iterations=1, warmup_runs=0)

if __name__ == "__main__":
    simple_response_perf.run(print_results=True)

可靠性

是什么让 Agent 或 Team 变得可靠？

它是否进行了预期的工具调用？
它是否能优雅地处理错误？
它是否遵守模型 API 的速率限制？

示例

第一个检查是确保 Agent 进行了预期的工具调用。这是一个示例：

reliability.py

from typing import Optional

from agno.agent import Agent
from agno.eval.reliability import ReliabilityEval, ReliabilityResult
from agno.tools.calculator import CalculatorTools
from agno.models.openai import OpenAIChat
from agno.run.response import RunResponse


def multiply_and_exponentiate():

    agent=Agent(
        model=OpenAIChat(id="gpt-4o-mini"),
        tools=[CalculatorTools(add=True, multiply=True, exponentiate=True)],
    )
    response: RunResponse = agent.run("What is 10*5 then to the power of 2? do it step by step")
    evaluation = ReliabilityEval(
        agent_response=response,
        expected_tool_calls=["multiply", "exponentiate"],
    )
    result: Optional[ReliabilityResult] = evaluation.run(print_results=True)
    result.assert_passed()


if __name__ == "__main__":
    multiply_and_exponentiate()

可靠性评估目前处于 beta 阶段。

简介

概念

其他

操作指南

简单的 Agent 评估

准确性

示例

性能

示例

可靠性

示例

简介

概念

其他

操作指南

​准确性

​示例

​性能

​示例

​可靠性

​示例

准确性

示例

性能

示例

可靠性

示例