Skip to content

Evaluation

This guide explains how to evaluate SwarmForge runs by scoring runtime artifacts such as events, checkpoints, routing, tool calls, and final state. It does not explain general test strategy outside the SwarmForge evaluation helpers.

This guide is for developers who want to measure runtime behavior in SDK or FastAPI flows. It assumes that you already know how to run a swarm and collect the resulting artifacts.

After reading this guide, you should be able to:

  • build a graph_snapshot
  • create or select scenario seeds
  • score SDK and FastAPI runs with evaluate_scenario_artifacts(...)
  • aggregate results across scenarios

The evaluation package is transport-agnostic. It scores runtime artifacts such as event logs, checkpoints, routing traces, tool calls, and final state, so the same evaluation flow works for:

  • direct SDK runs through process_swarm_stream(...)
  • FastAPI runs through create_swarm_app(...)
  • FastAPI runs through create_fastapi_app(...)

The core implementation lives under src/swarmforge/evaluation/, with graph and scoring helpers in src/swarmforge/evaluation/swarm.py and trace/conversation helpers in src/swarmforge/evaluation/runner/.

Evaluation overview

What evaluation works on

Evaluation does not depend on a specific UI or HTTP layer. It works on artifacts that every runtime path can produce:

  • a graph_snapshot
  • a scenario_seed
  • an event_log
  • a list of SessionCheckpoint records

That makes evaluation useful for:

  • local SDK regression tests
  • FastAPI endpoint tests
  • CI scoring after a run
  • comparing runtime behavior across providers

Core helpers

The main helper surface is:

  • build_graph_snapshot(...)
  • build_heuristic_swarm_intents(...)
  • build_intent_based_swarm_scenario_seeds(...)
  • classify_scenario_feasibility(...)
  • evaluate_scenario_artifacts(...)
  • aggregate_scenario_results(...)

ConversationRunner is available separately when you want model-backed multi-turn trace capture outside the swarm runtime itself.

SDK evaluation tutorial

Use the SDK path when you already run swarms directly in Python and want to score the exact artifacts returned by process_swarm_stream(...).

SDK evaluation flow

The shortest SDK evaluation flow is:

  1. build a SwarmDefinition
  2. convert it into a graph_snapshot
  3. derive scenario seeds
  4. run the swarm through process_swarm_stream(...)
  5. collect the emitted events and checkpoints
  6. score those artifacts

Build evaluation inputs

Start from a runtime swarm definition, then create a snapshot and scenario seed:

python
from swarmforge.authoring import build_swarm_definition
from swarmforge.evaluation import (
    build_graph_snapshot,
    build_heuristic_swarm_intents,
    build_intent_based_swarm_scenario_seeds,
    classify_scenario_feasibility,
)

swarm = build_swarm_definition(
    {
        "nodes": [
            {
                "node_key": "triage",
                "name": "Triage",
                "persona": "",
                "is_entry_node": True,
            },
            {
                "node_key": "billing",
                "name": "Billing",
                "persona": "",
                "is_entry_node": False,
            },
        ],
        "edges": [
            {
                "source_node_key": "triage",
                "target_node_key": "billing",
                "handoff_description": "Transfer after confirming the request is billing-related.",
                "required_variables": ["account_id"],
            }
        ],
        "variables": [
            {
                "key_name": "account_id",
                "description": "Customer account identifier",
                "reducer_rule": "overwrite",
            }
        ],
    },
    swarm_id="support",
    name="Support Swarm",
)

graph_snapshot = build_graph_snapshot(swarm)
intents = build_heuristic_swarm_intents(graph_snapshot, 2)
scenario_seed = build_intent_based_swarm_scenario_seeds(
    graph_snapshot,
    [intent["title"] for intent in intents],
    1,
    2,
)[0]

feasibility = classify_scenario_feasibility(graph_snapshot, scenario_seed)
print(feasibility["classification"])

For this swarm, feasibility is usually conditionally_feasible because the triage -> billing handoff depends on account_id.

Run the swarm and capture artifacts

Once you have a scenario_seed, run the swarm and collect both the event stream and persisted checkpoints:

python
import asyncio

from swarmforge.evaluation import evaluate_scenario_artifacts
from swarmforge.evaluation.provider import ModelConfig
from swarmforge.swarm import InMemorySessionStore, SwarmSession, build_turn_runner, process_swarm_stream


async def run_and_score():
    session = SwarmSession(id="eval-session-1", swarm=swarm)
    store = InMemorySessionStore()
    turn_runner = build_turn_runner(ModelConfig())

    events = []
    async for event in process_swarm_stream(
        session,
        scenario_seed["starting_prompt"],
        store=store,
        turn_runner=turn_runner,
    ):
        events.append(event)

    checkpoints = await store.list_checkpoints(session.id)
    score = evaluate_scenario_artifacts(
        graph_snapshot,
        scenario_seed,
        event_log=events,
        checkpoints=checkpoints,
    )
    return score


result = asyncio.run(run_and_score())
print(result["overall_score"])

This is the cleanest way to evaluate the SDK runtime because you score the exact events and checkpoints produced by the real orchestration loop.

Score SDK artifacts

evaluate_scenario_artifacts(...) checks five dimensions:

  • routing
  • variables
  • tools
  • minimum turns
  • agent coverage

If a scenario does not expect tool usage and no tool calls occur, the tools dimension still scores as satisfied instead of dragging down the overall score.

The returned object includes fields such as:

  • passed
  • overall_score
  • routing
  • variables
  • tools
  • actual_routing
  • actual_tools
  • final_globals

If you score several scenarios, combine them with aggregate_scenario_results(...):

python
from swarmforge.evaluation import aggregate_scenario_results

summary = aggregate_scenario_results([result])
print(summary["overall_score"])

SDK examples

Useful SDK-oriented evaluation entry points:

FastAPI evaluation tutorial

Use the FastAPI path when your application runs swarms behind HTTP and you want to score the same runtime artifacts that the API returns.

FastAPI evaluation flow

The FastAPI evaluation flow is:

  1. create or bind a FastAPI app
  2. send a run or message request
  3. collect the returned events and checkpoints
  4. build or reuse the matching graph_snapshot
  5. score the response artifacts with evaluate_scenario_artifacts(...)

The important point is that evaluation still happens against runtime artifacts, not against HTTP-specific data structures.

Evaluate a bound FastAPI swarm

For a bound swarm created with create_swarm_app(...), the message response already contains the artifacts you need.

python
import asyncio

import httpx

from swarmforge.api import create_swarm_app
from swarmforge.evaluation import (
    build_graph_snapshot,
    build_heuristic_swarm_intents,
    build_intent_based_swarm_scenario_seeds,
    evaluate_scenario_artifacts,
)


app = create_swarm_app(SUPPORT_SWARM)
graph_snapshot = build_graph_snapshot(SUPPORT_SWARM)
scenario_seed = build_intent_based_swarm_scenario_seeds(
    graph_snapshot,
    [intent["title"] for intent in build_heuristic_swarm_intents(graph_snapshot, 1)],
    1,
    1,
)[0]


async def evaluate_http_run():
    transport = httpx.ASGITransport(app=app)
    async with httpx.AsyncClient(transport=transport, base_url="http://testserver") as client:
        await client.post(
            "/sessions",
            json={"session_id": "support-1", "state": {"account_id": "ACME-991"}},
        )
        response = await client.post(
            "/sessions/support-1/messages",
            json={"user_input": scenario_seed["starting_prompt"]},
        )
        payload = response.json()

    return evaluate_scenario_artifacts(
        graph_snapshot,
        scenario_seed,
        event_log=payload["events"],
        checkpoints=payload["checkpoints"],
    )


score = asyncio.run(evaluate_http_run())
print(score["overall_score"])

That works for both single-agent and multi-agent bound FastAPI apps because the response contract is the same.

Evaluate the generic FastAPI transport

For create_fastapi_app(...), use the /v1/swarm/run or /v1/sessions/.../messages responses exactly the same way:

python
import asyncio

import httpx

from swarmforge.api import create_fastapi_app
from swarmforge.evaluation import (
    build_graph_snapshot,
    build_heuristic_swarm_intents,
    build_intent_based_swarm_scenario_seeds,
    evaluate_scenario_artifacts,
)
from swarmforge.authoring import build_swarm_definition


swarm = build_swarm_definition(
    {
        "nodes": [
            {"node_key": "assistant", "name": "Assistant", "is_entry_node": True},
        ],
        "edges": [],
        "variables": [],
    },
    swarm_id="single-agent",
    name="Single Agent",
)

graph_snapshot = build_graph_snapshot(swarm)
scenario_seed = build_intent_based_swarm_scenario_seeds(
    graph_snapshot,
    [intent["title"] for intent in build_heuristic_swarm_intents(graph_snapshot, 1)],
    1,
    1,
)[0]

app = create_fastapi_app()


async def evaluate_generic_api():
    transport = httpx.ASGITransport(app=app)
    async with httpx.AsyncClient(transport=transport, base_url="http://testserver") as client:
        response = await client.post(
            "/v1/swarm/run",
            json={
                "swarm": {
                    "id": "single-agent",
                    "name": "Single Agent",
                    "nodes": [
                        {
                            "node_key": "assistant",
                            "name": "Assistant",
                            "system_prompt": "You are a helpful assistant.",
                            "is_entry_node": True,
                        }
                    ],
                    "edges": [],
                    "variables": [],
                },
                "user_input": scenario_seed["starting_prompt"],
            },
        )
        payload = response.json()

    return evaluate_scenario_artifacts(
        graph_snapshot,
        scenario_seed,
        event_log=payload["events"],
        checkpoints=payload["checkpoints"],
    )

This is the right path when the swarm itself comes from a UI builder or another external control plane.

FastAPI examples

Useful FastAPI-oriented evaluation references:

Examples and artifacts

Example walkthrough

If you want one clean evaluation walkthrough from source:

  1. examples/build_support_swarm.py
  2. examples/run_support_swarm.py
  3. examples/evaluate_support_swarm.py
  4. examples/fastapi_swarm.py
  5. examples/fastapi_server.py

Core artifacts

  • graph_snapshot normalized, serializable representation of a swarm definition
  • scenario_seed prompt, routing expectations, success criteria, and turn requirements for one scenario
  • event_log runtime events from SDK or FastAPI execution
  • checkpoints persisted SessionCheckpoint records or serialized checkpoint payloads
  • artifact score routing, variable, tool, turn-count, and coverage scoring against those artifacts

Relationship to transport

Because evaluation scores artifacts instead of transports, the same evaluate_scenario_artifacts(...) call works for:

  • in-process SDK execution
  • typed FastAPI apps created with create_swarm_app(...)
  • generic FastAPI transport created with create_fastapi_app(...)

That is what makes evaluation a good regression layer for both package integrations and HTTP deployments.

Released as open source.