Appearance
Evaluation
This guide explains how to evaluate SwarmForge runs by scoring runtime artifacts such as events, checkpoints, routing, tool calls, and final state. It does not explain general test strategy outside the SwarmForge evaluation helpers.
This guide is for developers who want to measure runtime behavior in SDK or FastAPI flows. It assumes that you already know how to run a swarm and collect the resulting artifacts.
After reading this guide, you should be able to:
- build a
graph_snapshot - create or select scenario seeds
- score SDK and FastAPI runs with
evaluate_scenario_artifacts(...) - aggregate results across scenarios
The evaluation package is transport-agnostic. It scores runtime artifacts such as event logs, checkpoints, routing traces, tool calls, and final state, so the same evaluation flow works for:
- direct SDK runs through
process_swarm_stream(...) - FastAPI runs through
create_swarm_app(...) - FastAPI runs through
create_fastapi_app(...)
The core implementation lives under src/swarmforge/evaluation/, with graph and scoring helpers in src/swarmforge/evaluation/swarm.py and trace/conversation helpers in src/swarmforge/evaluation/runner/.
Evaluation overview
What evaluation works on
Evaluation does not depend on a specific UI or HTTP layer. It works on artifacts that every runtime path can produce:
- a
graph_snapshot - a
scenario_seed - an
event_log - a list of
SessionCheckpointrecords
That makes evaluation useful for:
- local SDK regression tests
- FastAPI endpoint tests
- CI scoring after a run
- comparing runtime behavior across providers
Core helpers
The main helper surface is:
build_graph_snapshot(...)build_heuristic_swarm_intents(...)build_intent_based_swarm_scenario_seeds(...)classify_scenario_feasibility(...)evaluate_scenario_artifacts(...)aggregate_scenario_results(...)
ConversationRunner is available separately when you want model-backed multi-turn trace capture outside the swarm runtime itself.
SDK evaluation tutorial
Use the SDK path when you already run swarms directly in Python and want to score the exact artifacts returned by process_swarm_stream(...).
SDK evaluation flow
The shortest SDK evaluation flow is:
- build a
SwarmDefinition - convert it into a
graph_snapshot - derive scenario seeds
- run the swarm through
process_swarm_stream(...) - collect the emitted events and checkpoints
- score those artifacts
Build evaluation inputs
Start from a runtime swarm definition, then create a snapshot and scenario seed:
python
from swarmforge.authoring import build_swarm_definition
from swarmforge.evaluation import (
build_graph_snapshot,
build_heuristic_swarm_intents,
build_intent_based_swarm_scenario_seeds,
classify_scenario_feasibility,
)
swarm = build_swarm_definition(
{
"nodes": [
{
"node_key": "triage",
"name": "Triage",
"persona": "",
"is_entry_node": True,
},
{
"node_key": "billing",
"name": "Billing",
"persona": "",
"is_entry_node": False,
},
],
"edges": [
{
"source_node_key": "triage",
"target_node_key": "billing",
"handoff_description": "Transfer after confirming the request is billing-related.",
"required_variables": ["account_id"],
}
],
"variables": [
{
"key_name": "account_id",
"description": "Customer account identifier",
"reducer_rule": "overwrite",
}
],
},
swarm_id="support",
name="Support Swarm",
)
graph_snapshot = build_graph_snapshot(swarm)
intents = build_heuristic_swarm_intents(graph_snapshot, 2)
scenario_seed = build_intent_based_swarm_scenario_seeds(
graph_snapshot,
[intent["title"] for intent in intents],
1,
2,
)[0]
feasibility = classify_scenario_feasibility(graph_snapshot, scenario_seed)
print(feasibility["classification"])For this swarm, feasibility is usually conditionally_feasible because the triage -> billing handoff depends on account_id.
Run the swarm and capture artifacts
Once you have a scenario_seed, run the swarm and collect both the event stream and persisted checkpoints:
python
import asyncio
from swarmforge.evaluation import evaluate_scenario_artifacts
from swarmforge.evaluation.provider import ModelConfig
from swarmforge.swarm import InMemorySessionStore, SwarmSession, build_turn_runner, process_swarm_stream
async def run_and_score():
session = SwarmSession(id="eval-session-1", swarm=swarm)
store = InMemorySessionStore()
turn_runner = build_turn_runner(ModelConfig())
events = []
async for event in process_swarm_stream(
session,
scenario_seed["starting_prompt"],
store=store,
turn_runner=turn_runner,
):
events.append(event)
checkpoints = await store.list_checkpoints(session.id)
score = evaluate_scenario_artifacts(
graph_snapshot,
scenario_seed,
event_log=events,
checkpoints=checkpoints,
)
return score
result = asyncio.run(run_and_score())
print(result["overall_score"])This is the cleanest way to evaluate the SDK runtime because you score the exact events and checkpoints produced by the real orchestration loop.
Score SDK artifacts
evaluate_scenario_artifacts(...) checks five dimensions:
- routing
- variables
- tools
- minimum turns
- agent coverage
If a scenario does not expect tool usage and no tool calls occur, the tools dimension still scores as satisfied instead of dragging down the overall score.
The returned object includes fields such as:
passedoverall_scoreroutingvariablestoolsactual_routingactual_toolsfinal_globals
If you score several scenarios, combine them with aggregate_scenario_results(...):
python
from swarmforge.evaluation import aggregate_scenario_results
summary = aggregate_scenario_results([result])
print(summary["overall_score"])SDK examples
Useful SDK-oriented evaluation entry points:
FastAPI evaluation tutorial
Use the FastAPI path when your application runs swarms behind HTTP and you want to score the same runtime artifacts that the API returns.
FastAPI evaluation flow
The FastAPI evaluation flow is:
- create or bind a FastAPI app
- send a run or message request
- collect the returned
eventsandcheckpoints - build or reuse the matching
graph_snapshot - score the response artifacts with
evaluate_scenario_artifacts(...)
The important point is that evaluation still happens against runtime artifacts, not against HTTP-specific data structures.
Evaluate a bound FastAPI swarm
For a bound swarm created with create_swarm_app(...), the message response already contains the artifacts you need.
python
import asyncio
import httpx
from swarmforge.api import create_swarm_app
from swarmforge.evaluation import (
build_graph_snapshot,
build_heuristic_swarm_intents,
build_intent_based_swarm_scenario_seeds,
evaluate_scenario_artifacts,
)
app = create_swarm_app(SUPPORT_SWARM)
graph_snapshot = build_graph_snapshot(SUPPORT_SWARM)
scenario_seed = build_intent_based_swarm_scenario_seeds(
graph_snapshot,
[intent["title"] for intent in build_heuristic_swarm_intents(graph_snapshot, 1)],
1,
1,
)[0]
async def evaluate_http_run():
transport = httpx.ASGITransport(app=app)
async with httpx.AsyncClient(transport=transport, base_url="http://testserver") as client:
await client.post(
"/sessions",
json={"session_id": "support-1", "state": {"account_id": "ACME-991"}},
)
response = await client.post(
"/sessions/support-1/messages",
json={"user_input": scenario_seed["starting_prompt"]},
)
payload = response.json()
return evaluate_scenario_artifacts(
graph_snapshot,
scenario_seed,
event_log=payload["events"],
checkpoints=payload["checkpoints"],
)
score = asyncio.run(evaluate_http_run())
print(score["overall_score"])That works for both single-agent and multi-agent bound FastAPI apps because the response contract is the same.
Evaluate the generic FastAPI transport
For create_fastapi_app(...), use the /v1/swarm/run or /v1/sessions/.../messages responses exactly the same way:
python
import asyncio
import httpx
from swarmforge.api import create_fastapi_app
from swarmforge.evaluation import (
build_graph_snapshot,
build_heuristic_swarm_intents,
build_intent_based_swarm_scenario_seeds,
evaluate_scenario_artifacts,
)
from swarmforge.authoring import build_swarm_definition
swarm = build_swarm_definition(
{
"nodes": [
{"node_key": "assistant", "name": "Assistant", "is_entry_node": True},
],
"edges": [],
"variables": [],
},
swarm_id="single-agent",
name="Single Agent",
)
graph_snapshot = build_graph_snapshot(swarm)
scenario_seed = build_intent_based_swarm_scenario_seeds(
graph_snapshot,
[intent["title"] for intent in build_heuristic_swarm_intents(graph_snapshot, 1)],
1,
1,
)[0]
app = create_fastapi_app()
async def evaluate_generic_api():
transport = httpx.ASGITransport(app=app)
async with httpx.AsyncClient(transport=transport, base_url="http://testserver") as client:
response = await client.post(
"/v1/swarm/run",
json={
"swarm": {
"id": "single-agent",
"name": "Single Agent",
"nodes": [
{
"node_key": "assistant",
"name": "Assistant",
"system_prompt": "You are a helpful assistant.",
"is_entry_node": True,
}
],
"edges": [],
"variables": [],
},
"user_input": scenario_seed["starting_prompt"],
},
)
payload = response.json()
return evaluate_scenario_artifacts(
graph_snapshot,
scenario_seed,
event_log=payload["events"],
checkpoints=payload["checkpoints"],
)This is the right path when the swarm itself comes from a UI builder or another external control plane.
FastAPI examples
Useful FastAPI-oriented evaluation references:
- docs/api for the current response shapes and route contracts
- examples/fastapi_swarm.py
- examples/fastapi_server.py
- examples/fastapi_tools_swarm.py
Examples and artifacts
Example walkthrough
If you want one clean evaluation walkthrough from source:
- examples/build_support_swarm.py
- examples/run_support_swarm.py
- examples/evaluate_support_swarm.py
- examples/fastapi_swarm.py
- examples/fastapi_server.py
Core artifacts
graph_snapshotnormalized, serializable representation of a swarm definitionscenario_seedprompt, routing expectations, success criteria, and turn requirements for one scenarioevent_logruntime events from SDK or FastAPI executioncheckpointspersistedSessionCheckpointrecords or serialized checkpoint payloadsartifact scorerouting, variable, tool, turn-count, and coverage scoring against those artifacts
Relationship to transport
Because evaluation scores artifacts instead of transports, the same evaluate_scenario_artifacts(...) call works for:
- in-process SDK execution
- typed FastAPI apps created with
create_swarm_app(...) - generic FastAPI transport created with
create_fastapi_app(...)
That is what makes evaluation a good regression layer for both package integrations and HTTP deployments.