Skip to content

Execution Graphs

Relayna's execution graph feature reconstructs what actually happened for a task at runtime. It is a read model built from durable state that already exists across Relayna's runtime packages:

  • Redis status history from RedisStatusStore
  • persisted task-linked observations from RedisObservationStore
  • DLQ records from DLQService and RedisDLQStore
  • parent-child aggregation lineage indexed from status meta.parent_task_id

This is intentionally different from the static workflow topology graph:

  • topology graph answers "what stages can connect"
  • status history answers "what happened in order"
  • execution graph answers "what caused what"

Public surfaces

Relayna exposes the execution-graph feature through:

  • relayna.observability.ExecutionGraph
  • relayna.observability.ExecutionGraphNode
  • relayna.observability.ExecutionGraphEdge
  • relayna.observability.ExecutionGraphSummary
  • relayna.observability.ExecutionGraphService
  • relayna.observability.build_execution_graph(...)
  • relayna.observability.execution_graph_mermaid(...)
  • relayna.api.create_execution_router(...)
  • relayna_studio.build_execution_view(...) from the separate relayna-studio deployment package

FastAPI route

Register the route from the FastAPI runtime:

from fastapi import FastAPI
from redis.asyncio import Redis

from relayna.api import create_execution_router, create_relayna_lifespan, get_relayna_runtime
from relayna.consumer import RetryPolicy, TaskConsumer
from relayna.observability import RedisObservationStore, make_redis_observation_sink

observation_store_prefix = "relayna-observations"

worker_observation_store = RedisObservationStore(
    Redis.from_url("redis://localhost:6379/0"),
    prefix=observation_store_prefix,
    ttl_seconds=86400,
    history_maxlen=500,
)

consumer = TaskConsumer(
    rabbitmq=client,
    handler=handle_task,
    retry_policy=RetryPolicy(max_retries=3, delay_ms=30000),
    observation_sink=make_redis_observation_sink(worker_observation_store),
)

app = FastAPI(
    lifespan=create_relayna_lifespan(
        topology=topology,
        redis_url="redis://localhost:6379/0",
        observation_store_prefix=observation_store_prefix,
        observation_store_ttl_seconds=86400,
        observation_history_maxlen=500,
    )
)

runtime = get_relayna_runtime(app)
app.include_router(create_execution_router(execution_graph_service=runtime.execution_graph_service))

The default route is:

  • GET /executions/{task_id}/graph

If you use ContractAliasConfig, the path parameter name follows http_aliases["task_id"] and the JSON body field follows field_aliases["task_id"].

Route behavior:

  • returns 404 only when Relayna has no status history, no persisted observations, and no DLQ records for the requested task
  • returns a graph even without persisted observations when status or DLQ data exists
  • sets summary.graph_completeness to "full" when observations were available
  • sets summary.graph_completeness to "partial" when the graph had to be reconstructed from status and DLQ data only

Runtime wiring

Execution graphs are most useful when both HTTP and workers share the same Redis-backed observation store.

FastAPI runtime configuration on create_relayna_lifespan(...):

  • observation_store_prefix
  • observation_store_ttl_seconds
  • observation_history_maxlen

The runtime then exposes:

  • runtime.observation_store
  • runtime.execution_graph_service

Worker-side persistence uses:

  • RedisObservationStore
  • make_redis_observation_sink(...)

Important operational rule:

  • use the same Redis instance and the same observation-store prefix in FastAPI and every worker process that should contribute to the graph

If workers do not persist observations, the route still works, but it can only reconstruct a partial graph.

Data model

Every topology returns the same outer response shape:

{
  "task_id": "task-123",
  "topology_kind": "shared_tasks_shared_status",
  "summary": {
    "status": "failed",
    "started_at": "2026-04-06T10:00:00+00:00",
    "ended_at": "2026-04-06T10:00:03+00:00",
    "duration_ms": 3000,
    "graph_completeness": "full"
  },
  "nodes": [],
  "edges": [],
  "annotations": {},
  "related_task_ids": []
}

Summary fields:

  • status: latest root-task status when one exists
  • started_at: earliest timestamp seen across graph nodes and edges
  • ended_at: latest root-task status timestamp when present, otherwise the latest graph timestamp
  • duration_ms: ended_at - started_at in milliseconds when both exist
  • graph_completeness: "full" or "partial"

Node kinds

Relayna currently emits these standard node kinds:

  • task The requested root task.
  • task_attempt One queue-consumption attempt for a task or aggregation child.
  • workflow_message A workflow transport message identified by message_id.
  • stage_attempt One stage execution attempt in a workflow topology.
  • status_event A persisted status history event.
  • retry A scheduled retry edge point between attempts.
  • dlq_record A dead-letter result from observation history or persisted DLQ detail.
  • aggregation_child A child task linked back to the requested parent task.

Each node carries:

  • id
  • kind
  • task_id
  • label
  • timestamp
  • annotations

Common annotations include:

  • queue_name
  • source_queue_name
  • consumer_name
  • correlation_id
  • retry_attempt
  • task_type
  • stage
  • message_id
  • origin_stage
  • reason
  • max_retries
  • meta

Edge kinds

Relayna currently emits these standard edge kinds:

  • received_by Root task to task attempt, retry to next attempt, or root task to workflow message when no origin stage exists yet.
  • published_status Attempt or stage attempt to status event.
  • retried_as Attempt to retry node.
  • dead_lettered_to Attempt to DLQ node.
  • manual_retry_to Attempt to later attempt when a manual_retrying status indicates handoff.
  • entered_stage Workflow message to stage attempt.
  • stage_transitioned_to Stage attempt to downstream workflow message.
  • aggregated_into Child task to parent task.

Each edge carries:

  • source
  • target
  • kind
  • timestamp
  • annotations

Topology-specific behavior

Shared tasks

For SharedTasksSharedStatusTopology, the execution graph is a task-attempt graph. Relayna does not invent workflow semantics for the shared-task topology, even though the topology shim has a compatibility "default" stage view in other places.

Typical shape:

flowchart LR
    T["task-123\ntask"] -->|received_by| A1["task-123 attempt 1\ntask_attempt"]
    A1 -->|published_status| S1["processing\nstatus_event"]
    A1 -->|retried_as| R1["retry 1\nretry"]
    R1 -->|received_by| A2["task-123 attempt 2\ntask_attempt"]
    A2 -->|dead_lettered_to| D1["handler_error\ndlq_record"]

Routed tasks

For RoutedTasksSharedStatusTopology, the graph is still attempt-oriented, but attempt nodes carry routed task_type metadata in their annotations. Manual handoff through TaskContext.manual_retry(...) is represented with manual_retry_to edges when the status history includes the manual_retrying transition.

Sharded aggregation

For sharded aggregation topologies, Relayna starts with the requested task id and then asks RedisStatusStore for child task ids indexed from status meta.parent_task_id.

That adds:

  • aggregation_child nodes for child tasks
  • aggregated_into edges from child task to parent task
  • child attempts, retries, DLQ nodes, and statuses for every discovered child

This lets one graph show both the parent task timeline and the work delegated to shard-owned child tasks.

Workflow

For SharedStatusWorkflowTopology, the graph becomes stage-aware:

  • workflow_message nodes come from WorkflowMessagePublished and WorkflowMessageReceived
  • stage_attempt nodes come from WorkflowMessageReceived
  • stage_transitioned_to edges connect a stage attempt to its downstream published workflow message
  • entered_stage edges connect a workflow message to the stage attempt that consumed it
  • status events attach to the most recent stage attempt at or before the status timestamp

Typical shape:

flowchart LR
    T["task-123\ntask"] -->|received_by| M1["planner\nworkflow_message"]
    M1 -->|entered_stage| P1["planner attempt 1\nstage_attempt"]
    P1 -->|published_status| S1["planning\nstatus_event"]
    P1 -->|stage_transitioned_to| M2["writer\nworkflow_message"]
    M2 -->|entered_stage| W1["writer attempt 1\nstage_attempt"]
    W1 -->|published_status| S2["completed\nstatus_event"]

Event sources used by the graph

Execution-graph reconstruction is a projection over existing durable artifacts.

Shared-task and routed-task graphs primarily use:

  • TaskMessageReceived
  • TaskLifecycleStatusPublished
  • ConsumerRetryScheduled
  • ConsumerDeadLetterPublished
  • status history from RedisStatusStore
  • optional persisted DLQ records from DLQService

Sharded aggregation graphs add:

  • AggregationMessageReceived
  • AggregationMessageAcked
  • AggregationHandlerFailed
  • AggregationRetryScheduled
  • AggregationDeadLetterPublished
  • child task discovery via RedisStatusStore.get_child_task_ids(...)

Workflow graphs primarily use:

  • WorkflowMessageReceived
  • WorkflowMessagePublished
  • WorkflowStageAcked
  • WorkflowStageFailed
  • shared status history for user-visible status events

Minimal vs full graphs

When observations are missing, Relayna still returns a useful graph:

  • root task node
  • one status_event node per stored status entry
  • DLQ nodes when persisted DLQ detail exists

That fallback graph is marked:

  • summary.graph_completeness = "partial"

When observation history exists, Relayna can reconstruct attempts, stage hops, retry edges, and richer annotations:

  • summary.graph_completeness = "full"

Response example

Example shared-task response:

{
  "task_id": "task-123",
  "topology_kind": "shared_tasks_shared_status",
  "summary": {
    "status": "failed",
    "started_at": "2026-04-06T10:00:00+00:00",
    "ended_at": "2026-04-06T10:00:03+00:00",
    "duration_ms": 3000,
    "graph_completeness": "full"
  },
  "nodes": [
    {
      "id": "task:task-123",
      "kind": "task",
      "task_id": "task-123",
      "label": "task-123",
      "timestamp": null,
      "annotations": {}
    },
    {
      "id": "task-attempt:task-123:1",
      "kind": "task_attempt",
      "task_id": "task-123",
      "label": "task-123 attempt 1",
      "timestamp": "2026-04-06T10:00:00+00:00",
      "annotations": {
        "queue_name": "tasks.queue",
        "retry_attempt": 0
      }
    }
  ],
  "edges": [
    {
      "source": "task:task-123",
      "target": "task-attempt:task-123:1",
      "kind": "received_by",
      "timestamp": "2026-04-06T10:00:00+00:00",
      "annotations": {}
    }
  ],
  "annotations": {},
  "related_task_ids": []
}

Mermaid export

Use execution_graph_mermaid(...) to get a Mermaid flowchart string from any ExecutionGraph instance:

from relayna.observability import execution_graph_mermaid

mermaid = execution_graph_mermaid(graph)
print(mermaid)

This is intended for:

  • docs
  • debugging
  • support tickets
  • copying a graph into Markdown or internal runbooks

The exporter emits a left-to-right Mermaid flowchart LR and labels every node with its user-visible label, node kind, and timestamp when present.

Studio and React Flow

The separate relayna-studio deployment package also exposes a Studio presenter helper:

from relayna_studio import build_execution_view

payload = build_execution_view(graph)

The Studio execution view contains:

  • task_id
  • topology_kind
  • summary
  • related_task_ids
  • graph
  • mermaid

The bundled Studio frontend in apps/studio/ uses the graph payload for React Flow rendering and the Mermaid string for docs/debug display. That means the same backend graph can drive:

  • operator-facing interactive graph exploration in the app
  • copy-paste Mermaid diagrams in documentation or incident notes

Practical guidance

  • Persist observations in every worker that should contribute to graph fidelity.
  • Keep the FastAPI runtime and workers on the same observation-store prefix.
  • Treat graph_completeness="partial" as a signal that observation wiring is missing or retention has already expired.
  • Use the execution graph as a runtime diagnostic view. Keep using the static workflow topology graph when you need to describe the intended workflow design.