Evaluation

This guide explains how to evaluate SwarmForge runs by scoring runtime artifacts such as events, checkpoints, routing, tool calls, and final state. It also covers the general-purpose evaluation engine for deterministic checks, LLM-as-a-judge rubrics, tool-use evaluation, and scenario role-play simulation.

This guide is for developers who want to measure runtime behavior in SDK or FastAPI flows. It assumes that you already know how to run a swarm and collect the resulting artifacts.

After reading this guide, you should be able to:

build a graph_snapshot
create or select scenario seeds
score SDK and FastAPI runs with evaluate_scenario_artifacts(...)
run deterministic assertions (word counts, regex, inclusion rules)
evaluate responses with an LLM judge using custom rubrics
evaluate tool trajectories with exact / in-order / any-order matching
simulate multi-turn user conversations for scenario role-play
aggregate results across scenarios and evaluation modes

The evaluation package is transport-agnostic. It scores runtime artifacts such as event logs, checkpoints, routing traces, tool calls, and final state, so the same evaluation flow works for:

direct SDK runs through process_swarm_stream(...)
FastAPI runs through create_swarm_app(...)
FastAPI runs through create_fastapi_app(...)
standalone conversation traces via ConversationRunner and EvaluationRunner

The core implementation lives under src/swarmforge/evaluation/:

Module	Purpose
`swarm.py`	Graph snapshots, scenario seeds, feasibility, artifact scoring
`_scoring.py`	Routing, variables, tools, turns, coverage scoring
`runner/conversation.py`	`ConversationRunner` — multi-turn trace capture with tool mocking
`runner/user_simulator.py`	`UserSimulator` — LLM-based scenario role-play
`evaluation/checks.py`	Deterministic check primitives (words, regex, inclusion, etc.)
`evaluation/deterministic.py`	`DeterministicEvaluator` — structured deterministic scoring
`evaluation/judge.py`	`JudgeEvaluator` — LLM-as-a-judge with rubrics and majority voting
`evaluation/tool_use.py`	`ToolUseJudgeEvaluator`, `ToolTrajectoryMatcher` — tool evaluation
`evaluation/runner.py`	`EvaluationRunner` — orchestrates turn + conversation evaluation

Evaluation overview

What evaluation works on

Evaluation does not depend on a specific UI or HTTP layer. It works on artifacts that every runtime path can produce:

a graph_snapshot
a scenario_seed
an event_log
a list of SessionCheckpoint records
a ConversationTrace (captured via ConversationRunner)

That makes evaluation useful for:

local SDK regression tests
FastAPI endpoint tests
CI scoring after a run
comparing runtime behavior across providers
standalone prompt/agent evaluation via EvaluationRunner

Core helpers

The main helper surface is:

build_graph_snapshot(...)
build_heuristic_swarm_intents(...)
build_intent_based_swarm_scenario_seeds(...)
classify_scenario_feasibility(...)
evaluate_scenario_artifacts(...)
aggregate_scenario_results(...)
ConversationRunner
UserSimulator
EvaluationRunner
DeterministicEvaluator
JudgeEvaluator
ToolTrajectoryMatcher

SDK evaluation tutorial

Use the SDK path when you already run swarms directly in Python and want to score the exact artifacts returned by process_swarm_stream(...).

SDK evaluation flow

The shortest SDK evaluation flow is:

build a SwarmDefinition
convert it into a graph_snapshot
derive scenario seeds
run the swarm through process_swarm_stream(...)
collect the emitted events and checkpoints
score those artifacts

Build evaluation inputs

Start from a runtime swarm definition, then create a snapshot and scenario seed:

python

from swarmforge.authoring import build_swarm_definition
from swarmforge.evaluation import (
    build_graph_snapshot,
    build_heuristic_swarm_intents,
    build_intent_based_swarm_scenario_seeds,
    classify_scenario_feasibility,
)

swarm = build_swarm_definition(
    {
        "nodes": [
            {
                "node_key": "triage",
                "name": "Triage",
                "persona": "",
                "is_entry_node": True,
                "sub_agents": [
                    {
                        "sub_agent": "billing",
                        "handoff_description": "Transfer after confirming the request is billing-related.",
                        "required_variables": ["account_id"],
                    }
                ],
            },
            {
                "node_key": "billing",
                "name": "Billing",
                "persona": "",
                "is_entry_node": False,
            },
        ],
        "variables": [
            {
                "key_name": "account_id",
                "description": "Customer account identifier",
                "reducer_rule": "overwrite",
            }
        ],
    },
    swarm_id="support",
    name="Support Swarm",
)

graph_snapshot = build_graph_snapshot(swarm)
intents = build_heuristic_swarm_intents(graph_snapshot, 2)
scenario_seed = build_intent_based_swarm_scenario_seeds(
    graph_snapshot,
    [intent["title"] for intent in intents],
    1,
    2,
)[0]

feasibility = classify_scenario_feasibility(graph_snapshot, scenario_seed)
print(feasibility["classification"])

For this swarm, feasibility is usually conditionally_feasible because the triage -> billing handoff depends on account_id.

Run the swarm and capture artifacts

Once you have a scenario_seed, run the swarm and collect both the event stream and persisted checkpoints:

python

import asyncio

from swarmforge.evaluation import evaluate_scenario_artifacts
from swarmforge.evaluation.provider import ModelConfig
from swarmforge.swarm import InMemorySessionStore, SwarmSession, build_turn_runner, process_swarm_stream


async def run_and_score():
    session = SwarmSession(id="eval-session-1", swarm=swarm)
    store = InMemorySessionStore()
    turn_runner = build_turn_runner(ModelConfig())

    events = []
    async for event in process_swarm_stream(
        session,
        scenario_seed["starting_prompt"],
        store=store,
        turn_runner=turn_runner,
    ):
        events.append(event)

    checkpoints = await store.list_checkpoints(session.id)
    score = evaluate_scenario_artifacts(
        graph_snapshot,
        scenario_seed,
        event_log=events,
        checkpoints=checkpoints,
    )
    return score


result = asyncio.run(run_and_score())
print(result["overall_score"])

This is the cleanest way to evaluate the SDK runtime because you score the exact events and checkpoints produced by the real orchestration loop.

Score SDK artifacts

evaluate_scenario_artifacts(...) checks five dimensions:

routing
variables
tools
minimum turns
agent coverage

If a scenario does not expect tool usage and no tool calls occur, the tools dimension still scores as satisfied instead of dragging down the overall score.

The returned object includes fields such as:

passed
overall_score
routing
variables
tools
actual_routing
actual_tools
final_globals

If you score several scenarios, combine them with aggregate_scenario_results(...):

python

from swarmforge.evaluation import aggregate_scenario_results

summary = aggregate_scenario_results([result])
print(summary["overall_score"])

SDK examples

Useful SDK-oriented evaluation entry points:

FastAPI evaluation tutorial

Use the FastAPI path when your application runs swarms behind HTTP and you want to score the same runtime artifacts that the API returns.

FastAPI evaluation flow

The FastAPI evaluation flow is:

create or bind a FastAPI app
send a run or message request
collect the returned events and checkpoints
build or reuse the matching graph_snapshot
score the response artifacts with evaluate_scenario_artifacts(...)

The important point is that evaluation still happens against runtime artifacts, not against HTTP-specific data structures.

Evaluate a bound FastAPI swarm

For a bound swarm created with create_swarm_app(...), the message response already contains the artifacts you need.

python

import asyncio

import httpx

from swarmforge.api import create_swarm_app
from swarmforge.evaluation import (
    build_graph_snapshot,
    build_heuristic_swarm_intents,
    build_intent_based_swarm_scenario_seeds,
    evaluate_scenario_artifacts,
)


app = create_swarm_app(SUPPORT_SWARM)
graph_snapshot = build_graph_snapshot(SUPPORT_SWARM)
scenario_seed = build_intent_based_swarm_scenario_seeds(
    graph_snapshot,
    [intent["title"] for intent in build_heuristic_swarm_intents(graph_snapshot, 1)],
    1,
    1,
)[0]


async def evaluate_http_run():
    transport = httpx.ASGITransport(app=app)
    async with httpx.AsyncClient(transport=transport, base_url="http://testserver") as client:
        await client.post(
            "/sessions",
            json={"session_id": "support-1", "state": {"account_id": "ACME-991"}},
        )
        response = await client.post(
            "/sessions/support-1/messages",
            json={"user_input": scenario_seed["starting_prompt"]},
        )
        payload = response.json()

    return evaluate_scenario_artifacts(
        graph_snapshot,
        scenario_seed,
        event_log=payload["events"],
        checkpoints=payload["checkpoints"],
    )


score = asyncio.run(evaluate_http_run())
print(score["overall_score"])

That works for both single-agent and multi-agent bound FastAPI apps because the response contract is the same.

Evaluate the generic FastAPI transport

For create_fastapi_app(...), use the /v1/swarm/run or /v1/sessions/.../messages responses exactly the same way:

python

import asyncio

import httpx

from swarmforge.api import create_fastapi_app
from swarmforge.evaluation import (
    build_graph_snapshot,
    build_heuristic_swarm_intents,
    build_intent_based_swarm_scenario_seeds,
    evaluate_scenario_artifacts,
)
from swarmforge.authoring import build_swarm_definition


swarm = build_swarm_definition(
    {
        "nodes": [
            {"node_key": "assistant", "name": "Assistant", "is_entry_node": True},
        ],
        "variables": [],
    },
    swarm_id="single-agent",
    name="Single Agent",
)

graph_snapshot = build_graph_snapshot(swarm)
scenario_seed = build_intent_based_swarm_scenario_seeds(
    graph_snapshot,
    [intent["title"] for intent in build_heuristic_swarm_intents(graph_snapshot, 1)],
    1,
    1,
)[0]

app = create_fastapi_app()


async def evaluate_generic_api():
    transport = httpx.ASGITransport(app=app)
    async with httpx.AsyncClient(transport=transport, base_url="http://testserver") as client:
        response = await client.post(
            "/v1/swarm/run",
            json={
                "swarm": {
                    "id": "single-agent",
                    "name": "Single Agent",
                    "nodes": [
                        {
                            "node_key": "assistant",
                            "name": "Assistant",
                            "system_prompt": "You are a helpful assistant.",
                            "is_entry_node": True,
                        }
                    ],
                    "variables": [],
                },
                "user_input": scenario_seed["starting_prompt"],
            },
        )
        payload = response.json()

    return evaluate_scenario_artifacts(
        graph_snapshot,
        scenario_seed,
        event_log=payload["events"],
        checkpoints=payload["checkpoints"],
    )

This is the right path when the swarm itself comes from a UI builder or another external control plane.

FastAPI examples

Useful FastAPI-oriented evaluation references:

docs/api for the current response shapes and route contracts
examples/fastapi_swarm.py
examples/fastapi_server.py
examples/fastapi_tools_swarm.py

General-purpose evaluation engine

In addition to swarm-specific artifact scoring, SwarmForge provides a general-purpose evaluation engine for any conversation trace. This engine supports:

Deterministic assertions — word limits, sentence counts, substring inclusion/exclusion, regex matching
LLM-as-a-judge — rubric-based scoring with multi-sample majority voting
Tool-use judge — rubric-based evaluation of tool call quality
Tool trajectory matching — exact, in-order, and any-order matching of tool call sequences
Scenario role-play simulation — UserSimulator drives multi-turn conversations following a plan

These components are useful when you want to evaluate a single agent, a prompt, or a tool pipeline independently of the swarm runtime.

Deterministic evaluation

Use DeterministicAssertions and DeterministicEvaluator to check responses without an LLM:

python

from swarmforge.evaluation import DeterministicAssertions, DeterministicEvaluator

evaluator = DeterministicEvaluator()
assertions = DeterministicAssertions(
    max_words=50,
    min_words=5,
    must_include=["hello"],
    must_not_include=["sorry"],
    must_ask_question=True,
    regex_match=[r"\d+"],
)

score = evaluator.evaluate("Hello! What is your order number? It is 12345.", assertions)
print(score.passed)   # True or False
print(score.score)    # 1.0 or 0.0
print(score.violations)

Available deterministic checks:

Check	Description
`max_words`	Response must not exceed N words
`min_words`	Response must contain at least N words
`max_sentences`	Response must not exceed N sentences
`must_be_single_sentence`	Response must be exactly one sentence
`must_ask_question`	Response must contain a question
`must_include`	Response must contain all given substrings
`must_not_include`	Response must not contain any given substrings
`regex_match`	Response must match all given regex patterns
`regex_not_match`	Response must not match any given regex patterns

LLM-as-a-judge

Use JudgeEvaluator to score responses against custom rubrics with an LLM judge:

python

from swarmforge.evaluation import JudgeConfig, JudgeEvaluator
from swarmforge.evaluation.provider import ModelConfig
from swarmforge.evaluation.trace.capture import TurnTrace

config = JudgeConfig(
    model=ModelConfig(provider="openrouter", model="gpt-4o", api_key="..."),
    num_samples=3,      # majority vote across 3 samples
    temperature=0.0,
    threshold=0.5,
)

judge = JudgeEvaluator(config)
turn = TurnTrace(
    turn_number=1,
    user_input="What is your refund policy?",
    assistant_response="We offer full refunds within 30 days.",
)

score = judge.evaluate(
    turn=turn,
    rubrics=[
        "The response is polite and professional.",
        "The response directly answers the user's question.",
        "The response is concise (under 50 words).",
    ],
    conversation_context=[
        {"role": "user", "content": "What is your refund policy?"},
    ],
)

print(score.passed)      # True if overall_score >= threshold
print(score.score)       # Average rubric score (0.0-1.0)
print(score.reasoning)   # Overall reasoning from judge
print(score.suggestions) # Actionable improvement suggestions
print(score.rubric_verdicts)

Tool trajectory matching

Use ToolTrajectoryMatcher to validate tool call sequences deterministically:

python

from swarmforge.evaluation import ToolTrajectoryMatcher, TrajectoryMatchMode

matcher = ToolTrajectoryMatcher()

actual = [
    {"name": "search_orders", "arguments": {"user_id": "u-1"}},
    {"name": "refund_order", "arguments": {"order_id": "o-99"}},
]

expected = [
    {"name": "search_orders", "arguments": {"user_id": "u-1"}},
    {"name": "refund_order", "arguments": {"order_id": "o-99"}},
]

result = matcher.match(actual, expected, TrajectoryMatchMode.EXACT)
print(result.passed)      # True
print(result.score)       # 1.0
print(result.explanation) # "Tool trajectory matches expected sequence."

Supported match modes:

Mode	Behavior
`EXACT`	Perfect match: name, args, order, no extras
`IN_ORDER`	All expected tools present in order; extras allowed but reported
`ANY_ORDER`	All expected tools present; order ignored; extras reported but don't fail

Tool-use judge

Use ToolUseJudgeEvaluator when you need an LLM to assess tool usage quality:

python

from swarmforge.evaluation import ToolUseJudgeEvaluator, JudgeConfig

config = JudgeConfig(model=ModelConfig(...))
tool_judge = ToolUseJudgeEvaluator(config)

score = tool_judge.evaluate(
    turn=turn,
    rubrics=[
        "The assistant called the correct tools for the task.",
        "Tool arguments were filled with values from the conversation context.",
        "Tools were called in a logical order.",
    ],
)

EvaluationRunner

EvaluationRunner orchestrates turn-by-turn and conversation-level evaluation. It can run deterministic checks, judge rubrics, tool-use rubrics, and trajectory matching together, then aggregate results:

python

from swarmforge.evaluation import EvaluationRunner, DeterministicAssertions
from swarmforge.evaluation.trace.capture import ConversationTrace, TurnTrace

runner = EvaluationRunner(judge_config=config)

trace = ConversationTrace(suite_id="suite-1", case_id="case-1")
trace.turns.append(TurnTrace(
    turn_number=1,
    user_input="Hi",
    assistant_response="Hello! How can I help you?",
))
trace.turns.append(TurnTrace(
    turn_number=2,
    user_input="I need a refund",
    assistant_response="Sure, I can help with that. What is your order ID?",
))

results = runner.evaluate_conversation(
    trace=trace,
    turn_assertions=[
        DeterministicAssertions(must_ask_question=True),
        DeterministicAssertions(max_words=20),
    ],
    turn_judge_rubrics=[
        None,
        ["The assistant asks for missing information politely."],
    ],
    enabled_matching=True,
    enabled_judgment=True,
)

print(results["overall_score"])
print(results["passed"])
print(results["violations"])
print(results["suggestions"])

You can also load assertions from a conversation spec:

python

spec = [
    {
        "assertions": {
            "deterministic": {"max_words": 50, "must_include": ["refund"]},
            "judge_rubrics": ["The response is helpful and polite."],
            "tool_trajectory": {
                "expected": [{"name": "search_orders", "arguments": {}}],
                "match_mode": "EXACT",
            },
        }
    }
]

results = runner.evaluate_from_spec(trace, spec)

For split results (matching vs judgment):

python

results = runner.evaluate_from_spec_split(trace, spec)
print(results["matching"]["deterministic"]["score"])
print(results["judgment"]["judge"]["score"])

Scenario role-play with UserSimulator

UserSimulator generates dynamic user turns to drive multi-turn evaluation scenarios:

python

from swarmforge.evaluation import UserSimulator
from swarmforge.evaluation.provider import ModelConfig, OpenAIClientWrapper

client = OpenAIClientWrapper(ModelConfig())
simulator = UserSimulator(
    client=client,
    conversation_plan="Ask about refunds, then ask about exchanges.",
)

signal, next_message = simulator.generate_next_turn([
    {"role": "user", "content": "Hi"},
    {"role": "assistant", "content": "Hello! How can I help?"},
])
print(signal)       # "CONTINUE" or "STOP"
print(next_message) # Simulated user message

Combined with ConversationRunner, this enables fully automated scenario role-play:

python

from swarmforge.evaluation import ConversationRunner

runner = ConversationRunner(
    client=agent_client,
    system_prompt="You are a helpful support agent.",
    tools=[...],
)

trace = runner.run_conversation_plan(
    suite_id="eval-suite",
    case_id="refund-case",
    starting_prompt="I want a refund.",
    conversation_plan="Ask about refunds, then ask about exchanges.",
    max_invocations=5,
    min_invocations=2,
    user_simulator=simulator,
)

Examples and artifacts

Example walkthrough

If you want one clean evaluation walkthrough from source:

Core artifacts

graph_snapshot normalized, serializable representation of a swarm definition
scenario_seed prompt, routing expectations, success criteria, and turn requirements for one scenario
event_log runtime events from SDK or FastAPI execution
checkpoints persisted SessionCheckpoint records or serialized checkpoint payloads
ConversationTrace multi-turn conversation trace with messages, tool calls, timings, and scores
artifact score routing, variable, tool, turn-count, and coverage scoring against those artifacts
evaluation score deterministic + judge + tool-use + trajectory scoring from EvaluationRunner

Relationship to transport

Because evaluation scores artifacts instead of transports, the same evaluate_scenario_artifacts(...) call works for:

in-process SDK execution
typed FastAPI apps created with create_swarm_app(...)
generic FastAPI transport created with create_fastapi_app(...)
standalone traces captured with ConversationRunner

That is what makes evaluation a good regression layer for both package integrations and HTTP deployments.

API

Authoring Transport

Single-Agent FastAPI

Multi-Agent FastAPI

Evaluation

SDK Evaluation Tutorial

FastAPI Evaluation Tutorial

Examples and Artifacts

Examples

Evaluation

Evaluation overview

What evaluation works on

Core helpers

SDK evaluation tutorial

SDK evaluation flow

Build evaluation inputs

Run the swarm and capture artifacts

Score SDK artifacts

SDK examples

FastAPI evaluation tutorial

FastAPI evaluation flow

Evaluate a bound FastAPI swarm

Evaluate the generic FastAPI transport

FastAPI examples

General-purpose evaluation engine

Deterministic evaluation

LLM-as-a-judge

Tool trajectory matching

Tool-use judge

EvaluationRunner

Scenario role-play with UserSimulator

Examples and artifacts

Example walkthrough

Core artifacts

Relationship to transport

Authoring Transport

Single-Agent FastAPI

Multi-Agent FastAPI

SDK Evaluation Tutorial

FastAPI Evaluation Tutorial

Examples and Artifacts

Evaluation ​

Evaluation overview ​

What evaluation works on ​

Core helpers ​

SDK evaluation tutorial ​

SDK evaluation flow ​

Build evaluation inputs ​

Run the swarm and capture artifacts ​

Score SDK artifacts ​

SDK examples ​

FastAPI evaluation tutorial ​

FastAPI evaluation flow ​

Evaluate a bound FastAPI swarm ​

Evaluate the generic FastAPI transport ​

FastAPI examples ​

General-purpose evaluation engine ​

Deterministic evaluation ​

LLM-as-a-judge ​

Tool trajectory matching ​

Tool-use judge ​

EvaluationRunner ​

Scenario role-play with UserSimulator ​

Examples and artifacts ​

Example walkthrough ​

Core artifacts ​

Relationship to transport ​

Evaluation

Evaluation overview

What evaluation works on

Core helpers

SDK evaluation tutorial

SDK evaluation flow

Build evaluation inputs

Run the swarm and capture artifacts

Score SDK artifacts

SDK examples

FastAPI evaluation tutorial

FastAPI evaluation flow

Evaluate a bound FastAPI swarm

Evaluate the generic FastAPI transport

FastAPI examples

General-purpose evaluation engine

Deterministic evaluation

LLM-as-a-judge

Tool trajectory matching

Tool-use judge

EvaluationRunner

Scenario role-play with UserSimulator

Examples and artifacts

Example walkthrough

Core artifacts

Relationship to transport