Service health
Temporal Cloud metrics help monitor the Service health of production deployments. This documentation covers best practices for monitoring Service health.
Monitor availability issues
When you see a sudden drop in Worker resource utilization, verify whether Temporal Cloud's API is showing increased latency and error rates.
Reference Metrics
This metric measures latency for SignalWithStartWorkflowExecution
, SignalWorkflowExecution
, StartWorkflowExecution
operations.
These operations are mission critical and never throttled.
This metric is a good indicator of your lowest possible latency.
Prometheus Query for this Metric
P99 service lag (histogram):
histogram_quantile(0.99, sum(rate(temporal_cloud_v0_service_latency_bucket[$__rate_interval])) by (temporal_namespace, operation, le))
Monitor Temporal Service errors
Check for Temporal Service gRPC API errors. Note that Service API errors are not equivalent to guarantees mentioned in the Temporal Cloud SLA.
Reference Metrics
Prometheus Query for this Metric
Measure your daily average errors over 10-minute windows:
avg_over_time((
(
(
sum(increase(temporal_cloud_v0_frontend_service_request_count{temporal_namespace=~"$namespace", operation=~"StartWorkflowExecution|SignalWorkflowExecution|SignalWithStartWorkflowExecution|RequestCancelWorkflowExecution|TerminateWorkflowExecution"}[10m]))
-
sum(increase(temporal_cloud_v0_frontend_service_error_count{temporal_namespace=~"$namespace", operation=~"StartWorkflowExecution|SignalWorkflowExecution|SignalWithStartWorkflowExecution|RequestCancelWorkflowExecution|TerminateWorkflowExecution"}[10m]))
)
/
sum(increase(temporal_cloud_v0_frontend_service_request_count{temporal_namespace=~"$namespace", operation=~"StartWorkflowExecution|SignalWorkflowExecution|SignalWithStartWorkflowExecution|RequestCancelWorkflowExecution|TerminateWorkflowExecution"}[10m]))
)
or vector(1)
)[1d:10m])