AKS observability stack¶
Relayna can run without Studio, Loki, Alloy, Prometheus, or Tempo. A complete Relayna Studio deployment on AKS needs those components when operators want centralized logs, Kubernetes metrics, aggregate Relayna runtime metrics, exact task resource samples, and trace correlation in one place.
Required components¶
| Component | Required for | Notes |
|---|---|---|
| Registered service API pods | Relayna capabilities, status/history, event feed, execution graph, optional /metrics |
Include create_metrics_router(runtime.metrics) when the API should expose runtime metrics. |
| Registered service worker pods | Task execution, lifecycle observations, worker-only /metrics |
Use start_metrics_http_server(runtime.metrics, port=8001) for worker-only processes. |
| Redis | Relayna status, DLQ indexes, observation history, Studio registry/events | API pods, worker pods, and Studio must share the same logical Redis data plane for complete task detail. |
| RabbitMQ | Relayna task queues, status fanout, retry/DLQ flows | Workers publish lifecycle status and observations around RabbitMQ task handling. |
| Loki | Studio log panels | Studio queries Loki from the backend. The browser does not connect to Loki directly. |
| Alloy | Kubernetes pod log collection, Loki forwarding, and optional OTLP trace receiving | Runs as a DaemonSet for logs. It can also run an OpenTelemetry collector pipeline that forwards spans to Tempo. |
| Prometheus | Studio Kubernetes metrics and Relayna runtime charts | Scrapes cAdvisor, kube-state-metrics, Relayna API /metrics, worker /metrics, and Studio backend /metrics. |
| kube-state-metrics | Pod phase, readiness, resource requests/limits, restart/OOM metrics | Prometheus needs this for the Phase 2 Kubernetes metric groups. |
| Tempo | Studio trace panels | Stores OpenTelemetry spans. Studio queries Tempo by trace_id through the backend. |
| Studio backend | Service registry, federation, log query proxy, metrics query proxy, trace query proxy, /metrics |
Configure egress allowlists for AKS service DNS and observability services. |
| Studio frontend | Operator UI | Talks only to the Studio backend. |
Prometheus labels must stay low-cardinality. Do not use task_id,
correlation_id, request_id, worker_id, pod, pod_name, container, or
message_id as Relayna runtime metric labels. Exact per-task CPU/RSS samples
are Relayna observations stored with task lifecycle data, not Prometheus series.
Architecture¶
The overall system looks like this when Relayna services, service workers, Studio, Redis, RabbitMQ, Loki, Alloy, Prometheus, Tempo, and kube-state-metrics run inside AKS:
flowchart TB
subgraph aks["AKS cluster"]
subgraph appns["Application namespaces"]
api["Registered service API pods\nFastAPI + Relayna routes\n/status /history /events/feed /executions /metrics"]
workers["Registered service worker pods\nTaskConsumer / workflow workers\nworker metrics helper :8001"]
end
rabbit["RabbitMQ\nTasks, status fanout, retry, DLQ"]
redis["Redis\nStatus, history, DLQ indexes,\nobservations, Studio registry/events"]
subgraph obs["observability namespace"]
alloy["Grafana Alloy DaemonSet\nCRI log tail + Kubernetes labels"]
loki["Loki\nLog storage/query"]
prom["Prometheus\nKubernetes + Relayna runtime metrics"]
tempo["Tempo\nTrace storage/query"]
ksm["kube-state-metrics\nPod/resource/status metrics"]
end
subgraph studio["Studio namespace"]
studioBackend["Studio backend\nRegistry, federation,\nLoki, Prometheus, Tempo proxy, /metrics"]
studioFrontend["Studio frontend\nOperator UI"]
end
end
api -->|publish tasks/status| rabbit
workers -->|consume tasks, publish status/DLQ| rabbit
api -->|status/history/observations| redis
workers -->|status/history/observations| redis
api -->|stdout JSON logs| alloy
workers -->|stdout JSON logs| alloy
alloy -->|push logs| loki
api -->|OTLP spans| alloy
workers -->|OTLP spans| alloy
alloy -->|export spans| tempo
prom -->|scrape /metrics| api
prom -->|scrape worker :8001/metrics| workers
prom -->|scrape /metrics| studioBackend
prom -->|scrape cAdvisor/kubelet| aks
prom -->|scrape| ksm
studioBackend -->|capabilities/status/history/events/graphs| api
studioBackend -->|read/write| redis
studioBackend -->|LogQL query| loki
studioBackend -->|PromQL query_range| prom
studioBackend -->|trace lookup| tempo
studioFrontend -->|Studio API only| studioBackend
Feature layers¶
The same AKS stack supports all four Studio observability phases:
- Centralized logs: pods write JSON logs to stdout, Alloy attaches Kubernetes metadata, and Loki stores the result for Studio log panels.
- Kubernetes metrics: Prometheus scrapes cAdvisor and kube-state-metrics so
Studio can show service and task-window infrastructure metrics. Studio uses
kube_pod_labelsto resolve registered service selector labels to owned pods, then joins platform metrics by namespace and pod. - Relayna runtime metrics and observations: API/worker
/metricsendpoints expose aggregate counters/histograms while Redis observations preserve exact per-task CPU/RSS samples for execution graphs. - Trace correlation: Relayna propagates W3C
traceparent/tracestatethrough RabbitMQ headers, the application-owned OpenTelemetry SDK exports spans to Alloy/Tempo, and Studio links task detail to trace spans.
Relayna pod conventions¶
Use stable Kubernetes labels on all pods that belong to one logical Relayna service:
metadata:
labels:
service: checker-service
app: checker-service-worker
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8001"
prometheus.io/path: "/metrics"
Recommended label meaning:
service: logical Relayna service registered in Studio.app: concrete emitter inside that service, such aschecker-service-api,checker-service-worker, orchecker-service-workflow.
For API pods that expose Relayna metrics through FastAPI:
from fastapi import FastAPI
from relayna.api import create_metrics_router, create_relayna_lifespan, get_relayna_runtime
app = FastAPI(lifespan=create_relayna_lifespan(topology=topology, redis_url=redis_url))
runtime = get_relayna_runtime(app)
app.include_router(create_metrics_router(runtime.metrics))
For worker-only pods:
from relayna.api import start_metrics_http_server
runtime = build_worker_runtime()
start_metrics_http_server(runtime.metrics, port=8001)
await runtime.run_forever()
If your runtime metrics use the SDK default service label value relayna, set
metrics_config.runtime_service_label_value in Studio. If you construct
RelaynaMetrics(service="checker-service"), that value can match the Studio
service_id instead.
Loki and Alloy log setup¶
Alloy should collect container stdout, parse Kubernetes metadata, keep low-cardinality labels, and push to Loki. Relayna task identifiers should remain inside JSON log bodies unless you intentionally accept the Loki cardinality cost.
Minimal Alloy River example:
logging {
level = "info"
format = "logfmt"
}
discovery.kubernetes "pods" {
role = "pod"
}
discovery.relabel "pod_logs" {
targets = discovery.kubernetes.pods.targets
rule {
source_labels = ["__meta_kubernetes_namespace"]
target_label = "namespace"
}
rule {
source_labels = ["__meta_kubernetes_pod_label_service"]
target_label = "service"
}
rule {
source_labels = ["__meta_kubernetes_pod_label_app"]
target_label = "app"
}
rule {
source_labels = ["__meta_kubernetes_pod_container_name"]
target_label = "container"
}
rule {
source_labels = ["__meta_kubernetes_pod_uid", "__meta_kubernetes_pod_container_name"]
separator = "/"
target_label = "__path__"
replacement = "/var/log/pods/*$1/*.log"
}
}
loki.source.kubernetes "pods" {
targets = discovery.relabel.pod_logs.output
forward_to = [loki.process.relayna.receiver]
}
loki.process "relayna" {
stage.cri {}
stage.json {
expressions = {
level = "level",
task_id = "task_id",
correlation_id = "correlation_id",
}
}
stage.labels {
values = {
level = "level",
}
}
forward_to = [loki.write.default.receiver]
}
loki.write "default" {
endpoint {
url = "http://loki.observability.svc.cluster.local:3100/loki/api/v1/push"
}
}
Keep these as normal Loki labels:
namespaceserviceappcontainerlevel
Keep these in the JSON log body by default:
task_idcorrelation_idrequest_idworker_id- message payload snippets
Prometheus setup¶
Prometheus needs four scrape paths for full Studio metrics:
- cAdvisor/kubelet metrics for CPU, memory, and network counters.
- kube-state-metrics for requests, limits, restarts, OOMKilled, pod phase, and readiness.
- Relayna API and worker
/metricsendpoints for aggregate runtime metrics. - Studio backend
/metricsfor Studio’s own runtime metrics.
Minimal Prometheus scrape config:
global:
scrape_interval: 15s
scrape_configs:
- job_name: kube-state-metrics
static_configs:
- targets:
- kube-state-metrics.observability.svc.cluster.local:8080
- job_name: kubernetes-cadvisor
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
- job_name: relayna-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: "true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: ${1}:${2}
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_pod_container_name]
target_label: container
- source_labels: [__meta_kubernetes_pod_label_service]
target_label: service
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
Tempo trace setup¶
Relayna core only depends on opentelemetry-api; each registered service that
wants real spans must add and configure its own OpenTelemetry SDK/exporter.
Relayna will propagate W3C trace context through RabbitMQ headers and create
safe spans around publish/consume/status/retry/DLQ paths when a tracer provider
is installed.
One practical AKS pattern is to send OTLP spans to Alloy and let Alloy export to Tempo:
otelcol.receiver.otlp "relayna" {
grpc {
endpoint = "0.0.0.0:4317"
}
http {
endpoint = "0.0.0.0:4318"
}
output {
traces = [otelcol.processor.batch.relayna.input]
}
}
otelcol.processor.batch "relayna" {
output {
traces = [otelcol.exporter.otlp.tempo.input]
}
}
otelcol.exporter.otlp "tempo" {
client {
endpoint = "tempo.observability.svc.cluster.local:4317"
tls {
insecure = true
}
}
}
For local development, sending spans directly to Tempo also works if Tempo's OTLP ports are exposed:
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
OTEL_SERVICE_NAME=orders-api
Studio service registration¶
Register each Relayna service with log_config, metrics_config, and
trace_config when all observability phases are enabled:
{
"service_id": "checker-service",
"name": "Checker Service",
"base_url": "http://checker-service-api.default.svc.cluster.local:8000",
"environment": "prod-aks",
"tags": ["checker", "aks"],
"auth_mode": "internal_network",
"log_config": {
"provider": "loki",
"base_url": "http://loki.observability.svc.cluster.local:3100",
"tenant_id": null,
"service_selector_labels": {
"service": "checker-service"
},
"source_label": "app",
"task_match_mode": "contains",
"task_match_template": "{task_id}",
"task_id_label": null,
"correlation_id_label": "correlation_id",
"level_label": "level"
},
"metrics_config": {
"provider": "prometheus",
"base_url": "http://prometheus.observability.svc.cluster.local:9090",
"namespace": "default",
"service_selector_labels": {
"service": "checker-service"
},
"runtime_service_label_value": "relayna",
"namespace_label": "namespace",
"pod_label": "pod",
"container_label": "container",
"step_seconds": 30,
"task_window_padding_seconds": 120
},
"trace_config": {
"provider": "tempo",
"base_url": "http://tempo.observability.svc.cluster.local:3200",
"public_base_url": null,
"tenant_id": null,
"query_path": "/api/traces/{trace_id}"
}
}
Configure Studio backend egress for AKS DNS:
RELAYNA_STUDIO_CAPABILITY_REFRESH_ALLOWED_HOSTS=.svc.cluster.local
If you use literal private IPs for Loki, Prometheus, Tempo, Redis, or
registered services, also set the matching CIDRs in
RELAYNA_STUDIO_CAPABILITY_REFRESH_ALLOWED_NETWORKS.
Complete registered service example¶
The example below shows a single orders-api registered service with API pods,
worker pods, JSON logs for Loki, /metrics for Prometheus, Redis observations
for exact task resource samples, and OpenTelemetry spans for Tempo.
Install application-owned tracing dependencies in the service image:
[project]
dependencies = [
"relayna>=1.4.12",
"opentelemetry-sdk>=1.28.0",
"opentelemetry-exporter-otlp-proto-grpc>=1.28.0",
"structlog>=24.0.0",
]
Configure structured logs and tracing once at process startup:
import logging
import structlog
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from relayna.observability import bind_studio_log_context
def configure_observability(*, service: str, app: str, env: str) -> structlog.BoundLogger:
logging.basicConfig(format="%(message)s", level=logging.INFO)
structlog.configure(
processors=[
structlog.processors.TimeStamper(fmt="iso", key="timestamp"),
structlog.processors.add_log_level,
structlog.processors.JSONRenderer(),
],
wrapper_class=structlog.make_filtering_bound_logger(logging.INFO),
)
provider = TracerProvider(
resource=Resource.create(
{
"service.name": app,
"relayna.service_id": service,
"deployment.environment": env,
}
)
)
provider.add_span_processor(
BatchSpanProcessor(
OTLPSpanExporter(endpoint="http://alloy.observability.svc.cluster.local:4317", insecure=True)
)
)
trace.set_tracer_provider(provider)
return bind_studio_log_context(
structlog.get_logger(),
service=service,
app=app,
env=env,
runtime=app,
)
Expose Relayna federation routes, metrics, status history, events, execution graphs, and observations from the API process:
from fastapi import FastAPI
from relayna.api import (
create_capabilities_router,
create_events_router,
create_execution_router,
create_metrics_router,
create_relayna_lifespan,
create_status_router,
get_relayna_runtime,
)
app = FastAPI(
lifespan=create_relayna_lifespan(
topology=topology,
redis_url="redis://redis.default.svc.cluster.local:6379/0",
observation_store_prefix="orders:observations",
service_event_store_prefix="orders:events",
metrics_service_name="orders-api",
)
)
runtime = get_relayna_runtime(app)
logger = configure_observability(service="orders-api", app="orders-api", env="prod-aks")
app.include_router(create_capabilities_router(topology=topology))
app.include_router(create_metrics_router(runtime.metrics))
app.include_router(
create_status_router(
sse_stream=runtime.sse_stream,
history_reader=runtime.history_reader,
latest_status_store=runtime.store,
)
)
app.include_router(create_events_router(service_event_store=runtime.service_event_store))
app.include_router(create_execution_router(execution_graph_service=runtime.execution_graph_service))
Wire the worker to the same Redis observation store, expose worker metrics, and emit task-aware JSON logs. Relayna handles trace propagation across RabbitMQ headers; the worker only needs the OpenTelemetry SDK configured at startup.
from redis.asyncio import Redis
from relayna.api import RelaynaMetrics, start_metrics_http_server
from relayna.consumer import RetryPolicy, TaskConsumer
from relayna.observability import RedisObservationStore, make_redis_observation_sink
redis = Redis.from_url("redis://redis.default.svc.cluster.local:6379/0")
observation_store = RedisObservationStore(redis, prefix="orders:observations")
logger = configure_observability(service="orders-api", app="orders-worker", env="prod-aks")
metrics = RelaynaMetrics(service="orders-api")
async def handle_order(message: dict) -> None:
logger.info(
"order_handler_started",
task_id=message["task_id"],
correlation_id=message.get("correlation_id"),
stage="worker",
)
...
consumer = TaskConsumer(
rabbitmq=rabbitmq_client,
handler=handle_order,
retry_policy=RetryPolicy(max_retries=3, delay_ms=30_000),
observation_sink=make_redis_observation_sink(observation_store),
metrics=metrics,
)
start_metrics_http_server(metrics, port=8001)
await consumer.run_forever()
Use low-cardinality pod labels and Prometheus scrape annotations on both API and worker workloads:
metadata:
labels:
service: orders-api
app: orders-worker
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8001"
prometheus.io/path: "/metrics"
Register the service in Studio with all three provider configs:
curl -X POST http://studio-backend.studio.svc.cluster.local:8000/studio/services \
-H 'Content-Type: application/json' \
-d '{
"service_id": "orders-api",
"name": "Orders API",
"base_url": "http://orders-api.default.svc.cluster.local:8000",
"environment": "prod-aks",
"tags": ["orders", "aks"],
"auth_mode": "internal_network",
"log_config": {
"provider": "loki",
"base_url": "http://loki.observability.svc.cluster.local:3100",
"service_selector_labels": {"service": "orders-api"},
"source_label": "app",
"task_match_mode": "contains",
"task_match_template": "{task_id}",
"correlation_id_label": null,
"level_label": "level"
},
"metrics_config": {
"provider": "prometheus",
"base_url": "http://prometheus.observability.svc.cluster.local:9090",
"namespace": "default",
"service_selector_labels": {"service": "orders-api"},
"runtime_service_label_value": "orders-api",
"namespace_label": "namespace",
"pod_label": "pod",
"container_label": "container",
"step_seconds": 30,
"task_window_padding_seconds": 120
},
"trace_config": {
"provider": "tempo",
"base_url": "http://tempo.observability.svc.cluster.local:3200",
"public_base_url": null,
"tenant_id": null,
"query_path": "/api/traces/{trace_id}"
}
}'
After a task runs, Studio should be able to show:
- service logs and task logs from Loki
- service and task-window metrics from Prometheus
- Relayna runtime charts and exact task CPU/RSS samples
- execution graph and task timeline from Redis-backed Relayna status and observation data
- trace IDs discovered from task detail/log fields and Tempo spans in the task detail Trace Correlation section
Bootstrap script¶
The repository includes a starter AKS deployment script:
scripts/deploy-relayna-observability-aks.sh
It installs a namespace, Loki, Alloy, Prometheus, Tempo, and kube-state-metrics with Relayna-compatible scrape, log-label, and OTLP trace forwarding defaults for all four phases. Review storage classes, resource requests, retention, auth, and network policy before using it in production.
Useful overrides:
NAMESPACE=observability
STORAGE_CLASS=managed-csi
LOKI_STORAGE_SIZE=10Gi
PROMETHEUS_STORAGE_SIZE=20Gi
TEMPO_STORAGE_SIZE=10Gi
LOKI_RETENTION=168h
PROMETHEUS_RETENTION=15d
TEMPO_RETENTION=168h
TEMPO_IMAGE=grafana/tempo:latest