Skip to main content
Version: 1.2.x

Key Metrics

Tetrate Service Bridge collects a large number of metrics. This page is generated from dashboards ran internally at Tetrate and will be updated periodically based on best practices learned from operational experiences in Tetrate and from user deployments. Each heading represents a different dashboard, and each sub-heading is a panel on this dashboard. For this reason, you may see metrics appear multiple times.

Global Configuration Distribution

These metrics indicate the overall health of Tetrate Service Bridge and should be considered the starting point for any investigation into issues with Tetrate Service Bridge.

Connected Clusters

This details all clusters connected to and receiving configuration from the management plane.

If this number drops below 1 or a given cluster does not appear in this table it means that the cluster is disconnected. This may happen for a brief period of time during upgrades/re-deploys.

Metric NameLabelsPromQL Expression
grpc_client_msg_received_totalcomponent grpc_type
count(sum(rate(grpc_client_msg_received_total{component="tsbd", grpc_type="server_stream"}[30s])) by (cluster_name)) by (cluster_name)

TSB Error Rate (Humans)

Rate of failed requests to the TSB apiserver from the UI and CLI.

Metric NameLabelsPromQL Expression
grpc_server_handled_totalcomponent grpc_code grpc_method grpc_type
sum(rate(grpc_server_handled_total{component="tsb", grpc_code!="OK", grpc_type="unary", grpc_method!="SendAuditLog"}[1m])) by (grpc_code) OR on() vector(0)

Istio-Envoy Sync Time (99th Percentile)

Once XCP has synced with the management plane it creates resources for Istio to configure Envoy. Istio usually distributes these within a second.

If this number starts to exceed 10 seconds then you may need to scale out istiod. In small clusters, it is possible this number is too small to be handled by the histogram buckets so may be nil.

Metric NameLabelsPromQL Expression
pilot_proxy_convergence_time_bucketN/A
histogram_quantile(0.99, sum(rate(pilot_proxy_convergence_time_bucket[1m])) by (le, cluster_name))

Istiod Errors

Rate of istiod errors broken down by cluster. This graph helps identify clusters that may be experiencing problems. Typically, there should be no errors. Any non-transient errors should be investigated.

Sometimes this graph will show "No data" or these metrics won't exist. This is because istiod only emits these metrics if the errors occur.

Metric NameLabelsPromQL Expression
pilot_total_xds_internal_errorsN/A
sum(rate(pilot_xds_write_timeout[1m])) by (cluster_name) + sum(rate(pilot_total_xds_internal_errors[1m])) by (cluster_name) + sum(rate(pilot_total_xds_rejects[1m])) by (cluster_name) +  sum(rate(pilot_xds_expired_nonce[1m])) by (cluster_name) + sum(rate(pilot_xds_push_context_errors[1m])) by (cluster_name)  + sum(rate(pilot_xds_pushes{type=~".*_senderr"}[1m])) by (cluster_name) OR on() vector(0)
pilot_total_xds_rejectsN/A
sum(rate(pilot_xds_write_timeout[1m])) by (cluster_name) + sum(rate(pilot_total_xds_internal_errors[1m])) by (cluster_name) + sum(rate(pilot_total_xds_rejects[1m])) by (cluster_name) +  sum(rate(pilot_xds_expired_nonce[1m])) by (cluster_name) + sum(rate(pilot_xds_push_context_errors[1m])) by (cluster_name)  + sum(rate(pilot_xds_pushes{type=~".*_senderr"}[1m])) by (cluster_name) OR on() vector(0)
pilot_xds_expired_nonceN/A
sum(rate(pilot_xds_write_timeout[1m])) by (cluster_name) + sum(rate(pilot_total_xds_internal_errors[1m])) by (cluster_name) + sum(rate(pilot_total_xds_rejects[1m])) by (cluster_name) +  sum(rate(pilot_xds_expired_nonce[1m])) by (cluster_name) + sum(rate(pilot_xds_push_context_errors[1m])) by (cluster_name)  + sum(rate(pilot_xds_pushes{type=~".*_senderr"}[1m])) by (cluster_name) OR on() vector(0)
pilot_xds_push_context_errorsN/A
sum(rate(pilot_xds_write_timeout[1m])) by (cluster_name) + sum(rate(pilot_total_xds_internal_errors[1m])) by (cluster_name) + sum(rate(pilot_total_xds_rejects[1m])) by (cluster_name) +  sum(rate(pilot_xds_expired_nonce[1m])) by (cluster_name) + sum(rate(pilot_xds_push_context_errors[1m])) by (cluster_name)  + sum(rate(pilot_xds_pushes{type=~".*_senderr"}[1m])) by (cluster_name) OR on() vector(0)
pilot_xds_pushestype
sum(rate(pilot_xds_write_timeout[1m])) by (cluster_name) + sum(rate(pilot_total_xds_internal_errors[1m])) by (cluster_name) + sum(rate(pilot_total_xds_rejects[1m])) by (cluster_name) +  sum(rate(pilot_xds_expired_nonce[1m])) by (cluster_name) + sum(rate(pilot_xds_push_context_errors[1m])) by (cluster_name)  + sum(rate(pilot_xds_pushes{type=~".*_senderr"}[1m])) by (cluster_name) OR on() vector(0)
pilot_xds_write_timeoutN/A
sum(rate(pilot_xds_write_timeout[1m])) by (cluster_name) + sum(rate(pilot_total_xds_internal_errors[1m])) by (cluster_name) + sum(rate(pilot_total_xds_rejects[1m])) by (cluster_name) +  sum(rate(pilot_xds_expired_nonce[1m])) by (cluster_name) + sum(rate(pilot_xds_push_context_errors[1m])) by (cluster_name)  + sum(rate(pilot_xds_pushes{type=~".*_senderr"}[1m])) by (cluster_name) OR on() vector(0)

Istio Operational Status

Operational metrics for istiod health.

Connected Envoys

Count of Envoys connected to istiod. This should represent the total number of endpoints in the selected cluster.

If this number significantly decreases for longer than 5 minutes without an obvious reason (e.g. a scale-down event) then you should investigate. This may indicate that Envoys have been disconnected from istiod and are unable to reconnect.

Metric NameLabelsPromQL Expression
pilot_xdscluster_name
sum(pilot_xds{cluster_name="$cluster"})

Total Error Rate

The total error rate for Istio when configuring Envoy, including generation and transport errors.

Any errors (current and historic) should be investigated using the more detailed split below.

Metric NameLabelsPromQL Expression
pilot_total_xds_internal_errorscluster_name
sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_internal_errors{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_rejects{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_expired_nonce{cluster_name="$cluster"}[1m])) +   sum(rate(pilot_xds_push_context_errors{cluster_name="$cluster"}[1m])) +   sum(rate(pilot_xds_pushes{cluster_name="$cluster", type=~".*_senderr"}[1m])) OR on() vector(0)
pilot_total_xds_rejectscluster_name
sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_internal_errors{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_rejects{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_expired_nonce{cluster_name="$cluster"}[1m])) +   sum(rate(pilot_xds_push_context_errors{cluster_name="$cluster"}[1m])) +   sum(rate(pilot_xds_pushes{cluster_name="$cluster", type=~".*_senderr"}[1m])) OR on() vector(0)
pilot_xds_expired_noncecluster_name
sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_internal_errors{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_rejects{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_expired_nonce{cluster_name="$cluster"}[1m])) +   sum(rate(pilot_xds_push_context_errors{cluster_name="$cluster"}[1m])) +   sum(rate(pilot_xds_pushes{cluster_name="$cluster", type=~".*_senderr"}[1m])) OR on() vector(0)
pilot_xds_push_context_errorscluster_name
sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_internal_errors{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_rejects{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_expired_nonce{cluster_name="$cluster"}[1m])) +   sum(rate(pilot_xds_push_context_errors{cluster_name="$cluster"}[1m])) +   sum(rate(pilot_xds_pushes{cluster_name="$cluster", type=~".*_senderr"}[1m])) OR on() vector(0)
pilot_xds_pushescluster_name type
sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_internal_errors{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_rejects{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_expired_nonce{cluster_name="$cluster"}[1m])) +   sum(rate(pilot_xds_push_context_errors{cluster_name="$cluster"}[1m])) +   sum(rate(pilot_xds_pushes{cluster_name="$cluster", type=~".*_senderr"}[1m])) OR on() vector(0)
pilot_xds_write_timeoutcluster_name
sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_internal_errors{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_rejects{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_expired_nonce{cluster_name="$cluster"}[1m])) +   sum(rate(pilot_xds_push_context_errors{cluster_name="$cluster"}[1m])) +   sum(rate(pilot_xds_pushes{cluster_name="$cluster", type=~".*_senderr"}[1m])) OR on() vector(0)

Median Proxy Convergence Time

The median (50th percentile) delay between istiod receiving configuration changes and the proxy receiving all required configuration in the selected cluster. This number indicates how stale the proxy configuration is. As this number increases, it may start to impact application traffic.

This number is typically in the hundreds of milliseconds. In small clusters, this number may be zero.

If this number creeps up to 30s for an extended period, istiod likely needs to be scaled out (or up).

Metric NameLabelsPromQL Expression
pilot_proxy_convergence_time_bucketcluster_name
histogram_quantile(0.5, sum(rate(pilot_proxy_convergence_time_bucket{cluster_name="$cluster"}[1m])) by (le))

Istiod Push Rate

The rate of istiod pushes to Envoy grouped by discovery service. Istiod pushes clusters (CDS), endpoints (EDS), listeners (LDS) or routes (RDS) any time it receives a configuration change.

Changes are triggered by a user interacting with TSB or a change in infrastructure such as a new endpoint (service instance/pod) creation.

In small relatively static clusters these values can be zero most of the time.

Metric NameLabelsPromQL Expression
pilot_xds_pushescluster_name type
sum(irate(pilot_xds_pushes{cluster_name="$cluster", type=~"cds|eds|rds|lds"}[1m])) by (type)

Istiod Error Rate

The different error rates for Istio during general operations. Including the generation and distribution of Envoy configuration.

pilot_xds_write_timeout Rate of connection timeouts between Envoy and istiod. This number indicates that an Envoy has taken too long to acknowledge a configuration change from Istio. An increase in these errors typically indicates network issues, envoy resource limits or istiod resource limits (usually cpu)

pilot_total_xds_internal_errors Rate of errors thrown inside istiod whilst generating Envoy configuration. Check the istiod logs for more details if you see internal errors.

pilot_total_xds_rejects Rate of rejected configuration from Envoy. Istio should never produce any invalid Envoy configuration so any errors here warrants investigation, starting with the istiod logs.

pilot_xds_expired_nonce Rate of expired nonces from Envoys. This number indicates that an Envoy has responded to the wrong request sent from Istio. An increase in these errors typically indicates network issues (saturation or partition), Envoy resource limits or istiod resource limits (usually cpu).

pilot_xds_push_context_errors Rate of errors setting a connection with an Envoy instance. An increase in these errors typically indicates network issues (saturation or partition), Envoy resource limits or istiod resource limits (usually cpu). Check istiod logs for further details.

pilot_xds_pushes Rate of transport errors sending configuration to Envoy. An increase in these errors typically indicates network issues (saturation or partition), Envoy resource limits or istiod resource limits (usually cpu).

Metric NameLabelsPromQL Expression
pilot_total_xds_internal_errorscluster_name
sum(rate(pilot_total_xds_internal_errors{cluster_name="$cluster"}[1m]))
pilot_total_xds_rejectscluster_name
sum(rate(pilot_total_xds_rejects{cluster_name="$cluster"}[1m]))
pilot_xds_expired_noncecluster_name
sum(rate(pilot_xds_expired_nonce{cluster_name="$cluster"}[1m]))
pilot_xds_push_context_errorscluster_name
sum(rate(pilot_xds_push_context_errors{cluster_name="$cluster"}[1m]))
pilot_xds_pushescluster_name type
sum(rate(pilot_xds_pushes{cluster_name="$cluster", type=~".*_senderr"}[1m])) by (type)
pilot_xds_write_timeoutcluster_name
sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m]))

Proxy Convergence Time

The delay between an istiod receiving configuration changes and a proxy receiving all required configuration in the cluster. Broken down by percentiles.

This number indicates how stale the proxy configuration is. As this number increases it may start to affect application traffic.

This number is typically in the hundreds of milliseconds. If this number creeps up to 30s for an extended period of time, it is likely that istiod needs to be scaled out (or up) as it is likely pinned up against its CPU limits.

Metric NameLabelsPromQL Expression
pilot_proxy_convergence_time_bucketcluster_name
histogram_quantile(0.5, sum(rate(pilot_proxy_convergence_time_bucket{cluster_name="$cluster"}[1m])) by (le))
pilot_proxy_convergence_time_bucketcluster_name
histogram_quantile(0.90, sum(rate(pilot_proxy_convergence_time_bucket{cluster_name="$cluster"}[1m])) by (le))
pilot_proxy_convergence_time_bucketcluster_name
histogram_quantile(0.99, sum(rate(pilot_proxy_convergence_time_bucket{cluster_name="$cluster"}[1m])) by (le))
pilot_proxy_convergence_time_bucketcluster_name
histogram_quantile(0.999, sum(rate(pilot_proxy_convergence_time_bucket{cluster_name="$cluster"}[1m])) by (le))

Configuration Validation

Success and failure rate of istio configuration validation requests. This is triggered when TSB configuration is created or updated.

Any failures here should be investigated in the istiod and tsbd logs.

If there are TSB configuration changes being made that affect the selected cluster and the success number is zero then there is an issue with configuration propagation. Check the tsbd logs to debug further.

Metric NameLabelsPromQL Expression
galley_validation_failedcluster_name
sum(rate(galley_validation_failed{cluster_name="$cluster"}[1m]))
galley_validation_passedcluster_name
sum(rate(galley_validation_passed{cluster_name="$cluster"}[1m]))

Sidecar Injection

Rate of sidecar injection requests. Sidecar injection is triggered whenever a new instance/pod is created.

Any errors displayed here should be investigated further by checking the istiod logs.

Metric NameLabelsPromQL Expression
sidecar_injection_failure_totalcluster_name
sum(rate(sidecar_injection_failure_total{cluster_name="$cluster"}[1m]))
sidecar_injection_success_totalcluster_name
sum(rate(sidecar_injection_success_total{cluster_name="$cluster"}[1m]))

MPC Operational Status

Operational metrics to indicate Management Plane Controller (MPC) health.

Config Update Messages

Config update messages sent over the gRPC stream from TSB and received by MPC.

This metric can help understand how messages are queued in MPC when it is under load. The value for both metrics should always be the same. If the Received by MPC metric has a value lower than the TSB one, it means MPC is under load and cannot process all messages sent by TSB as fast as TSB is sending them.

Metric NameLabelsPromQL Expression
grpc_client_msg_received_totalcomponent grpc_method
sum(increase(grpc_client_msg_received_total{component="mpc", grpc_method="GetAllConfigObjects"}[5m])) or on() vector(0)
grpc_server_msg_sent_totalcomponent grpc_method
sum(increase(grpc_server_msg_sent_total{component="tsb", grpc_method="GetAllConfigObjects"}[5m])) or on() vector(0)

Config updates processed every 5m

The number of configuration updates received by the Management Plane Controller (MPC) is to be processed and sent to XCP.

TSB sends the config updates over a permanently connected gRPC stream to MPC, and this metric shows the number of messages received and processed by MPC on that stream.

Metric NameLabelsPromQL Expression
permanent_stream_operationerror name
sum(increase(permanent_stream_operation{name="ConfigUpdates", error=""}[5m])) or on() vector(0)
permanent_stream_operationerror name
sum(increase(permanent_stream_operation{name="ConfigUpdates", error!=""}[5m])) or on() vector(0)

Config stream connection attempts every 5m

The number of connection (and reconnection) attempts on the config updates stream.

TSB sends the config updates over a permanently connected gRPC stream to MPC. This metric shows the number of connections and reconnections that happened on that stream.

Metric NameLabelsPromQL Expression
permanent_stream_connection_attemptserror name
sum(increase(permanent_stream_connection_attempts{name="ConfigUpdates", error=""}[5m])) or on() vector(0)
permanent_stream_connection_attemptserror name
sum(increase(permanent_stream_connection_attempts{name="ConfigUpdates", error!=""}[5m])) or on() vector(0)

XCP Config Push Duration

Time it took for configuration objects to be pushed to XCP.

This metric shows the time it takes for MPC to apply all the configuration objects in the XCP namespace once all the configuration objects have been received from TSB and translated into XCP objects.

Metric NameLabelsPromQL Expression
mpc_xcp_config_push_timeerror
mpc_xcp_config_push_time{error=""} or on() vector(0)
mpc_xcp_config_push_timeerror
mpc_xcp_config_push_time{error!=""} or on() vector(0)

TSB to MPC sent configs

The number of resources that sent from TSB to MPC.

This metric shows the number of objects that are created, updated, and deleted as part of a configuration push from MPC to XCP.

This metric can be used together with the XCP push operations and push duration to get an understanding of how the amount of resources being pushed to XCP affects the time it takes for the entire configuration push operation to complete.

Metric NameLabelsPromQL Expression
mpc_tsb_config_received_countN/A
mpc_tsb_config_received_count

XCP Resource conversion rate

Configuration updates received from TSB are processed by MPC and translated into XCP resources. This metric shows the conversion rate of TSB resources to XCP resources. It gives a good idea of the number of resources of each type in the runtime configuration.

Metric NameLabelsPromQL Expression
mpc_xcp_conversion_countN/A
sum(rate(mpc_xcp_conversion_count[1m])) by (resource)

MPC to XCP pushed configs

The number of resources that are pushed to XCP.

This metric shows the number of objects that are created, updated, and deleted as part of a configuration push from MPC to XCP. It also shows how many fetch calls to the k8s api server are done.

This metric can be used together with the TSB tp MPC sent configs and XCP push operations and push duration to get an understanding of how the amount of resources being pushed to XCP affects the time it takes for the entire configuration push operation to complete.

Metric NameLabelsPromQL Expression
mpc_xcp_config_create_opsN/A
sum(mpc_xcp_config_create_ops)
mpc_xcp_config_delete_opsN/A
sum(mpc_xcp_config_delete_ops)
mpc_xcp_config_fetch_opsN/A
sum(mpc_xcp_config_fetch_ops)
mpc_xcp_config_update_opsN/A
sum(mpc_xcp_config_update_ops)

XCP Resource conversion error rate

Configuration updates received from TSB are processed by MPC and translated into XCP resources. This metric shows the conversion error rate of TSB resources to XCP resources. It should always be zero. If there are errors reported in this graph, there are incompatibilities between the XCP resources and the TSB ones. This may be the result of mismatching version compatibility between TSB and XCP.

Metric NameLabelsPromQL Expression
mpc_xcp_conversion_counterror
sum(rate(mpc_xcp_conversion_count{error != ""}[1m])) by (resource) or on() vector(0)

MCP to XCP pushed configs error

The number of resources that failed while pushing to XCP.

This metric shows the number of objects that fail when they are tried to be created, updated, and deleted as part of a configuration push from MPC to XCP. It also shows the number of failed fetch calls to the k8s api server.

This metric can be used together with the MPC to TSB push configs and the XCP push operations and push duration to get an understanding of how the amount of resources being pushed to XCP affects the time it takes for the entire configuration push operation to complete.

Metric NameLabelsPromQL Expression
mpc_xcp_config_create_ops_errN/A
sum(mpc_xcp_config_create_ops_err)
mpc_xcp_config_delete_ops_errN/A
sum(mpc_xcp_config_delete_ops_err)
mpc_xcp_config_fetch_ops_errN/A
sum(mpc_xcp_config_fetch_ops_err)
mpc_xcp_config_update_ops_errN/A
sum(mpc_xcp_config_update_ops_err)

Cluster Update Messages

Cluster update messages sent over the gRPC stream from TSB and received by MPC.

This metric can help understand how messages are queued in MPC when it is under load. The value for both metrics should always be the same. If the Received by MPC metric has a value lower than the TSB one, it means MPC is under load and cannot process all messages sent by TSB as fast as TSB is sending them.

Metric NameLabelsPromQL Expression
grpc_client_msg_received_totalcomponent grpc_method
sum(increase(grpc_client_msg_received_total{component="mpc", grpc_method="GetAllClusters"}[5m])) or on() vector(0)
grpc_server_msg_sent_totalcomponent grpc_method
sum(increase(grpc_server_msg_sent_total{component="tsb", grpc_method="GetAllClusters"}[5m])) or on() vector(0)

TSB Cluster updates processed every 5m

The number of cluster updates received by the Management Plane Controller (MPC) that must be processed and sent to XCP.

TSB sends the cluster updates (e.g. new onboarded clusters, deleted clusters) over a permanently connected gRPC stream to MPC. This metric shows the number of messages received and processed by MPC on that stream.

Metric NameLabelsPromQL Expression
permanent_stream_operationerror name
sum(increase(permanent_stream_operation{name="ClusterPush", error=""}[5m])) or on() vector(0)
permanent_stream_operationerror name
sum(increase(permanent_stream_operation{name="ClusterPush", error!=""}[5m])) or on() vector(0)

TSB Cluster stream connection attempts every 5m

The number of connection (and reconnection) attempts on the cluster updates stream. TSB sends the cluster updates over a permanently connected gRPC stream to MPC. This metric shows the number of connections and reconnections that happened on that stream.

Metric NameLabelsPromQL Expression
permanent_stream_connection_attemptserror name
sum(increase(permanent_stream_connection_attempts{name="ClusterPush", error=""}[5m])) or on() vector(0)
permanent_stream_connection_attemptserror name
sum(increase(permanent_stream_connection_attempts{name="ClusterPush", error!=""}[5m])) or on() vector(0)

Cluster Status Update from XCP

Cluster status update messages received from XCP over a gRPC stream.

Metric NameLabelsPromQL Expression
grpc_client_msg_received_totalcomponent grpc_method
sum(increase(grpc_client_msg_received_total{component="mpc", grpc_method="GetClusterState"}[5m])) or on() vector(0)

Cluster updates from XCP processed every 5m

The number of cluster status updates received by the Management Plane Controller (MPC) from XCP that must be processed and sent to TSB.

XCP sends the cluster status updates (e.g. services deployed in the cluster) over a permanently connected gRPC stream to MPC. This metric shows the number of messages received and processed by MPC on that stream.

Metric NameLabelsPromQL Expression
permanent_stream_operationerror name
sum(increase(permanent_stream_operation{name="ClusterStateFromXCP", error=""}[5m])) or on() vector(0)
permanent_stream_operationerror name
sum(increase(permanent_stream_operation{name="ClusterStateFromXCP", error!=""}[5m])) or on() vector(0)

Cluster updates from XCP stream connection attempts every 5m

The number of connection (and reconnection) attempts on the cluster status updates from XCP stream. XCP sends the cluster status updates over a permanently connected gRPC stream to MPC. This metric shows the number of connections and reconnections that happened on that stream.

Metric NameLabelsPromQL Expression
permanent_stream_connection_attemptserror name
sum(increase(permanent_stream_connection_attempts{name="ClusterStateFromXCP", error=""}[5m])) or on() vector(0)
permanent_stream_connection_attemptserror name
sum(increase(permanent_stream_connection_attempts{name="ClusterStateFromXCP", error!=""}[5m])) or on() vector(0)

XCP cluster status updates processed every 5m

This is the number of cluster status updates that are processed by the Management Plane Controller (MPC) to be sent to TSB.

MPC sends the cluster status updates over a gRPC stream that is permanently connected to TSB, and this metric shows the number of cluster updates that are processed by MPC and sent to TSB on that stream.

Metric NameLabelsPromQL Expression
permanent_stream_operationerror name
sum(increase(permanent_stream_operation{name="ClusterUpdates", error=""}[5m])) or on() vector(0)
permanent_stream_operationerror name
sum(increase(permanent_stream_operation{name="ClusterUpdates", error!=""}[5m])) or on() vector(0)

Cluster status updates to TSB stream connection attempts every 5m

The number of connection (and reconnection) attempts on the cluster status updates stream. MPC sends the cluster status updates over a permanently connected gRPC stream to TSB. This metric shows the number of connections and reconnections that happened on that stream.

Metric NameLabelsPromQL Expression
permanent_stream_connection_attemptserror name
sum(increase(permanent_stream_connection_attempts{name="ClusterUpdates", error=""}[5m])) or on() vector(0)
permanent_stream_connection_attemptserror name
sum(increase(permanent_stream_connection_attempts{name="ClusterUpdates", error!=""}[5m])) or on() vector(0)

OAP Operational Status

Operational metrics to indicate Tetrate Service Bridge OAP stack health.

OAP Request Rate

The request rate to OAP, by status.

Metric NameLabelsPromQL Expression
envoy_cluster_upstream_rq_xxenvoy_cluster_name plane
sum by (envoy_response_code_class) (rate(envoy_cluster_upstream_rq_xx{envoy_cluster_name="oap-grpc", plane="management"}[1m]))

OAP Request Latency

The OAP, request latency.

Metric NameLabelsPromQL Expression
envoy_cluster_upstream_rq_time_bucketenvoy_cluster_name plane
histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name="oap-grpc", plane="management"}[1m])) by (le))
envoy_cluster_upstream_rq_time_bucketenvoy_cluster_name plane
histogram_quantile(0.95, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name="oap-grpc", plane="management"}[1m])) by (le))
envoy_cluster_upstream_rq_time_bucketenvoy_cluster_name plane
histogram_quantile(0.90, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name="oap-grpc", plane="management"}[1m])) by (le))
envoy_cluster_upstream_rq_time_bucketenvoy_cluster_name plane
histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name="oap-grpc", plane="management"}[1m])) by (le))
envoy_cluster_upstream_rq_time_bucketenvoy_cluster_name plane
histogram_quantile(0.50, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name="oap-grpc", plane="management"}[1m])) by (le))

OAP Aggregation Request Rate

OAP Aggregation Request Rate, by type:

  • central aggregation service handler received
  • central application aggregation received
  • central service aggregation received
Metric NameLabelsPromQL Expression
central_aggregation_handlerN/A
sum(rate(central_aggregation_handler[1m]))
central_app_aggregationN/A
sum(rate(central_app_aggregation[1m]))
central_service_aggregationN/A
sum(rate(central_service_aggregation[1m]))

OAP Aggregation Rows

Cumulative rate of rows in OAP aggreagation.

Metric NameLabelsPromQL Expression
metrics_aggregationplane
sum(rate(metrics_aggregation{plane="management"}[1m]))

OAP Mesh Analysis Latency

The process latency of OAP service mesh telemetry streaming process.

Metric NameLabelsPromQL Expression
mesh_analysis_latency_bucketcomponent plane
histogram_quantile(0.99, sum(rate(mesh_analysis_latency_bucket{plane="control", component="oap"}[1m])) by (le))
mesh_analysis_latency_bucketcomponent plane
histogram_quantile(0.95, sum(rate(mesh_analysis_latency_bucket{plane="control", component="oap"}[1m])) by (le))
mesh_analysis_latency_bucketcomponent plane
histogram_quantile(0.90, sum(rate(mesh_analysis_latency_bucket{plane="control", component="oap"}[1m])) by (le))
mesh_analysis_latency_bucketcomponent plane
histogram_quantile(0.75, sum(rate(mesh_analysis_latency_bucket{plane="control", component="oap"}[1m])) by (le))

JVM Threads

Numbed of threads in OAP JVM

Metric NameLabelsPromQL Expression
jvm_threads_currentcomponent plane
sum(jvm_threads_current{component="oap", plane="management"})
jvm_threads_daemoncomponent plane
sum(jvm_threads_daemon{component="oap", plane="management"})
jvm_threads_deadlockedcomponent plane
sum(jvm_threads_deadlocked{component="oap", plane="management"})
jvm_threads_peakcomponent plane
sum(jvm_threads_peak{component="oap", plane="management"})

JVM Memory

JVM Memory stats of OAP JVM instances.

Metric NameLabelsPromQL Expression
jvm_memory_bytes_maxcomponent plane
sum by (area, instance) (jvm_memory_bytes_max{component="oap", plane="management"})
jvm_memory_bytes_usedcomponent plane
sum by (area, instance) (jvm_memory_bytes_used{component="oap", plane="management"})

TSB Operational Status

Operational metrics to indicate Tetrate Service Bridge API server health.

AuthZ Success Rate

Rate of successful requests to the AuthZ server. This includes all user and cluster requests into the management plane.

Note: This indicates the health of the AuthZ server not whether the user or cluster making the request has the correct permissions.

Metric NameLabelsPromQL Expression
envoy_cluster_internal_upstream_rqenvoy_response_code
sum(rate(envoy_cluster_internal_upstream_rq{envoy_response_code=~"2.*"}[1m])) by (envoy_cluster_name)

AuthZ Error Rate

The error rate of requests to the AuthZ server. This includes all user and cluster requests into the management plane. Note: This indicates the health of the AuthZ server not whether the user or cluster making the request has the correct permissions.

Metric NameLabelsPromQL Expression
envoy_cluster_internal_upstream_rqenvoy_response_code
sum(rate(envoy_cluster_internal_upstream_rq{envoy_response_code!~"2.*"}[1m])) by (envoy_cluster_name)

AuthZ Latency

AuthZ request latency percentiles.

Metric NameLabelsPromQL Expression
envoy_cluster_internal_upstream_rq_time_bucketN/A
histogram_quantile(0.99, sum(rate(envoy_cluster_internal_upstream_rq_time_bucket[1m])) by (le, envoy_cluster_name))
envoy_cluster_internal_upstream_rq_time_bucketN/A
histogram_quantile(0.95, sum(rate(envoy_cluster_internal_upstream_rq_time_bucket[1m])) by (le, envoy_cluster_name))

TSB Success Rate

Rate of successful requests to the TSB apiserver from the UI and CLI.

Metric NameLabelsPromQL Expression
grpc_server_handled_totalcomponent grpc_code grpc_method grpc_type
sum(rate(grpc_server_handled_total{component="tsb", grpc_code="OK", grpc_type="unary", grpc_method!="SendAuditLog"}[1m])) by (grpc_method)

TSB Error Rate

Rate of failed requests to the TSB apiserver from the UI and CLI.

Metric NameLabelsPromQL Expression
grpc_server_handled_totalcomponent grpc_code grpc_method grpc_type
sum(rate(grpc_server_handled_total{component="tsb", grpc_code!="OK", grpc_type="unary", grpc_method!="SendAuditLog"}[1m])) by (grpc_code)

Data Store Success Rate

Successful request rate for operations persisting data to the datastore grouped by method and kind.

This graph also includes transactions. These are standard SQL transactions and consist of multiple operations.

Metric NameLabelsPromQL Expression
persistence_operationerror
sum(rate(persistence_operation{error=""}[1m])) by (kind, method)
persistence_transactionerror
sum(rate(persistence_transaction{error=""}[1m]))

Data Store Latency

The request latency for operations persisting data to the datastore grouped by method.

This graph also includes transactions. These are standard SQL transactions and consist of multiple operations.

Metric NameLabelsPromQL Expression
persistence_operation_duration_bucketN/A
histogram_quantile(0.99, sum(rate(persistence_operation_duration_bucket[1m])) by (le, method))
persistence_transaction_duration_bucketN/A
histogram_quantile(0.99, sum(rate(persistence_transaction_duration_bucket[1m])) by (le))

Data Store Error Rate

The request error rate for operations persisting data to the datastore grouped by method and kind. This graph also includes transactions. These are standard SQL transactions and consists of multiple operations. Note: The graph explicitly excludes "resource not found" errors. A small number of "not found" responses are normal as TSB for optimization often uses Get queries instead of Exists to determine the resource existence.

Metric NameLabelsPromQL Expression
persistence_operationerror
sum(rate(persistence_operation{error!="", error!="resource not found"}[1m])) by (kind, method)
persistence_transactionerror
sum(rate(persistence_transaction{error!=""}[1m]))

Active Transactions

The number of running transactions on the datastore.

This graph shows how many active transactions are running at a given point in time. It helps you understand the load of the system generated by concurrent access to the platform.

Metric NameLabelsPromQL Expression
persistence_concurrent_transactionN/A
sum(persistence_concurrent_transaction)

Dual-Write Operations Request Rate

The request rate for operations persisting data to the Q Graph or Persistent Data Store via dual-write framework (zero downtime data model migrations).

This graph consists of total request rate grouped by the write stage (primary/secondary) as well as error rate grouped by stage/error code.

  • primary writes are always executed synchronously, and any failure in a primary write will manifest as well as an API error.
  • secondary writes are done in the background and do not manifest in direct API errors. Failures are allowed here, and the data reconcile process will fix any inconsistencies between the primary and secondary models.
Metric NameLabelsPromQL Expression
dualop_operationstage
sum(rate(dualop_operation{stage!=""}[1m])) by (stage)
dualop_operationerror stage
sum(rate(dualop_operation{stage!="", error!=""}[1m])) by (stage, error)

Dual-Write Operations Latency

The request latency for operations persisting data to the Q Graph or Persistent Data Store via dual-write framework. Dual-writes ensure Zero Downtime Data model migrations.

  • primary writes are always executed synchronously, and any failure in a primary write will manifest as well as an API error.
  • secondary writes are done in the background and do not manifest in direct API errors. Failures are allowed here, and the data reconcile process will fix any inconsistencies between the primary and secondary models.
Metric NameLabelsPromQL Expression
dualop_operation_duration_bucketstage
histogram_quantile(0.99, sum(rate(dualop_operation_duration_bucket{stage!=""}[1m])) by (le, stage))
dualop_operation_duration_bucketstage
histogram_quantile(0.95, sum(rate(dualop_operation_duration_bucket{stage!=""}[1m])) by (le, stage))
dualop_operation_duration_bucketstage
histogram_quantile(0.90, sum(rate(dualop_operation_duration_bucket{stage!=""}[1m])) by (le, stage))
dualop_operation_duration_bucketstage
histogram_quantile(0.75, sum(rate(dualop_operation_duration_bucket{stage!=""}[1m])) by (le, stage))

PDP Success Rate

Successful request rate of PDP grouped by method. NGAC is a graph-based authorization framework that consists of three main components: Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. The other components of NGAC use this graph to perform access decisions. Policy Decision Point (PDP): Performs access decisions based on the NGAC graph's policies. The PDP is used to perform binary Allow/Deny access decisions (Check) and determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP). Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. A successful request does not represent "access granted" decisions; they represent the access decision requests for which a verdict was obtained. A drop in this metric may show that operations against the graph are failing. This may mean that the graph is unavailable for reads and this is usually a consequence of failures in the PIP. Failures in the PIP for write operations will result in the graph not being properly updated to the latest status, resulting in access decisions based on stale models.

Metric NameLabelsPromQL Expression
ngac_pdp_operationerror
sum(rate(ngac_pdp_operation{error=""}[1m])) by (method)

PDP Error Rate

Rate of errors for PDP requests grouped by method. NGAC is a graph-based authorization framework that consists of three main components: Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. The other components of NGAC use this graph to perform access decisions. Policy Decision Point (PDP): Performs access decisions based on the NGAC graph's policies. The PDP is used to perform binary Allow/Deny access decisions (Check) and determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP). Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. A successful request does not represent "access granted" decisions; they represent the access decision requests where a verdict was obtained. Failed requests to the PDP show the number of requests from the PEP to the PDP that have failed. They do not represent "access denied" decisions; they represent the access decision requests where a verdict could not be obtained. A rise in this metric may show that operations against the graph are failing. This may mean that the graph is unavailable for reads and this is usually a consequence of failures in the PIP. Failures in the PIP for write operations will result in the graph not being correctly updated to the latest status, resulting in access decisions based on stale models.

Metric NameLabelsPromQL Expression
ngac_pdp_operationerror
sum(rate(ngac_pdp_operation{error!=""}[1m])) by (method)

PDP Latency

PDP latency percentiles grouped by method. NGAC is a graph-based authorization framework that consists of three main components: Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. The other components of NGAC use this graph to perform access decisions. Policy Decision Point (PDP): Performs access decisions based on the NGAC graph's policies. The PDP is used to perform binary Allow/Deny access decisions (Check) and determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP). Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. A successful request does not represent "access granted" decisions; they represent the access decision requests for which a verdict was obtained. This metric shows the time it takes to get an access decision for authorization requests. Degradation in PDP operations may result in general degradation of the system. PDP latency represents the time it takes to make access decisions, and that will impact user experience since access decisions are made and enforced for every operation.

Metric NameLabelsPromQL Expression
ngac_pdp_operation_duration_bucketN/A
histogram_quantile(0.99, sum(rate(ngac_pdp_operation_duration_bucket[1m])) by (method, le))
ngac_pdp_operation_duration_bucketN/A
histogram_quantile(0.95, sum(rate(ngac_pdp_operation_duration_bucket[1m])) by (method, le))

PIP Success Rate

Successful request rate of PIP grouped by method.

NGAC is a graph based authorization framework that consists on three main components:

  • Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. This graph is used by the other components of NGAC to perform access decisions.
  • Policy Decision Point (PDP): Performs access decisions based on the policies configured in the NGAC graph. The PDP is used to perform binary Allow/Deny access decisions (Check), and to determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP).
  • Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. Successful request do not represent "access granted" decisions; they just represent the access decision requests for which a verdict was obtained.

PIP operations are executed against the NGAC graph to represent and maintain the objects in the system and the relationships between them.

A drop in this metric may show that operations against the graph are failing. This may mean that the graph is unavailable for reads, or that it is failing to persist data. Read failures may result in failed access decisions (in the PDP) and user interaction with the system may be rejected as well. Failures in write operations will result in the graph not being properly updated to the latest status and that could result in access decisions based on stale models.

Metric NameLabelsPromQL Expression
ngac_pip_operationerror
sum(rate(ngac_pip_operation{error=""}[1m])) by (method)

PIP Latency

PiP latency percentiles grouped by method.

NGAC is a graph based authorization framework that consists on three main components:

  • Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. This graph is used by the other components of NGAC to perform access decisions.
  • Policy Decision Point (PDP): Performs access decisions based on the policies configured in the NGAC graph. The PDP is used to perform binary Allow/Deny access decisions (Check), and to determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP).
  • Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. Successful request do not represent "access granted" decisions; they just represent the access decision requests for which a verdict was obtained.

This metric shows the time it takes for a PIP operation to complete and, in the case of write operations, to have data persisted in the NGAC graph.

Degradation in PIP operations may result in general degradation of the system. PIP latency represents the time it takes to access the NGAC graph, and this directly affects the PDP when running access decisions. A degraded PIP may result in a degraded PDP, and that will impact user experience, as access decisions are made and enforced for every operation.

Metric NameLabelsPromQL Expression
ngac_pip_operation_duration_bucketN/A
histogram_quantile(0.99, sum(rate(ngac_pip_operation_duration_bucket[1m])) by (method, le))
ngac_pip_operation_duration_bucketN/A
histogram_quantile(0.95, sum(rate(ngac_pip_operation_duration_bucket[1m])) by (method, le))

PIP Error Rate

Rate of errors for PIP requests grouped by method.

NGAC is a graph based authorization framework that consists on three main components:

  • Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. This graph is used by the other components of NGAC to perform access decisions.
  • Policy Decision Point (PDP): Performs access decisions based on the policies configured in the NGAC graph. The PDP is used to perform binary Allow/Deny access decisions (Check), and to determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP).
  • Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. Successful request do not represent "access granted" decisions; they just represent the access decision requests for which a verdict was obtained.

PIP operations are executed against the NGAC graph to represent and maintain the objects in the system and the relationships between them.

Note: the "Node not found" errors are explicitly excluded as TSB often uses GetNode method instead of Exists to determine the node existence, for the purposes of optimisation.

A general raise in this metric may show that operations against the graph are failing. This may mean that the graph is unavailable for reads, or that it is failing to persist data. Read failures may result in failed access decisions (in the PDP) and user interaction with the system may be rejected as well. Failures in write operations will result in the graph not being properly updated to the latest status and that could result in access decisions based on stale models.

Metric NameLabelsPromQL Expression
ngac_pip_operationerror
sum(rate(ngac_pip_operation{error!="", error!="Node not found"}[1m])) by (method)

Active PIP Transactions

The number of running transactions on the NGAC PIP. NGAC is a graph-based authorization framework that consists on three main components: Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. The other components of NGAC use this graph to perform access decisions. Policy Decision Point (PDP): Performs access decisions based on the NGAC graph's policies. The PDP is used to perform binary Allow/Deny access decisions (Check) and determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP). Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. A successful request does not represent "access granted" decisions; they represent the access decision requests for which a verdict was obtained. This metric shows the number of active write operations against the NGAC graph. It can be useful to understand the load of the system generated by concurrent access to the platform.

Metric NameLabelsPromQL Expression
ngac_pip_concurrent_transactionN/A
sum(ngac_pip_concurrent_transaction)

XCP Central Operational Status

Operational metrics to indicate XCP Central health.

XCP Central Version

Metric NameLabelsPromQL Expression
xcp_central_istio_buildN/A
label_replace(xcp_central_istio_build, "version", "$1", "tag", "(.*)")
xcp_central_versionN/A
label_replace(xcp_central_version, "xcp_version", "$1", "version", "(.*)")

Time since last config propagation by Edge (seconds)

Time since last config propagation by Edge and Stat us (sent/received)

Metric NameLabelsPromQL Expression
xcp_central_last_config_propagation_event_timestamp_msedge
time() - min(xcp_central_last_config_propagation_event_timestamp_ms{edge!=""} / 1000) by (edge, status)

Config Propagation Event Rate by Edge

Number of config propagation events by edge cluster

Metric NameLabelsPromQL Expression
xcp_central_config_propagation_event_countN/A
sum(rate(xcp_central_config_propagation_event_count[1m])) by (edge, status)

Config Updates Event Rate

Number of config updates triggered and the reason: event and group-version-kind (GVK in k8s API terminology).

Metric NameLabelsPromQL Expression
xcp_central_config_update_push_countN/A
sum(rate(xcp_central_config_update_push_count[1m])) by (event, kind)

Config Update Error Rate

Number of config updates triggered and the reason (event and GVK)

Metric NameLabelsPromQL Expression
xcp_central_config_update_error_countN/A
sum(rate(xcp_central_config_update_error_count[1m])) OR on() vector(0)

Config Propagation Latency by Edge

Distribution of time to propagate updates from Central (Management plane) to Edges

Metric NameLabelsPromQL Expression
xcp_central_config_propagation_time_ms_bucketN/A
histogram_quantile(0.99, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge))
xcp_central_config_propagation_time_ms_bucketN/A
histogram_quantile(0.95, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge))
xcp_central_config_propagation_time_ms_bucketN/A
histogram_quantile(0.90, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge))
xcp_central_config_propagation_time_ms_bucketN/A
histogram_quantile(0.75, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge))
xcp_central_config_propagation_time_ms_bucketN/A
histogram_quantile(0.50, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge))

Number of Edge Connections

Number of Edges connected to this Central

Metric NameLabelsPromQL Expression
xcp_central_current_edge_connectionsN/A
sum(xcp_central_current_edge_connections) OR on() vector(0)

Time since last cert rotation

Time (in seconds) since last XCP mTLS certificate rotation.

Metric NameLabelsPromQL Expression
xcp_central_last_cert_rotation_timestamp_msN/A
time() - (min(xcp_central_last_cert_rotation_timestamp_ms) by (cert) / 1000)

Rate of webhook validation passed

Rate of webhook validation passed by GVK

Metric NameLabelsPromQL Expression
xcp_central_validation_webhook_passed_countN/A
sum(rate(xcp_central_validation_webhook_passed_count[5m])) by (group, resource)

Rate of webhook validation errors

Rate of webhook validation errors by GVK

Metric NameLabelsPromQL Expression
xcp_central_validation_webhook_failed_countN/A
sum(rate(xcp_central_validation_webhook_failed_count[5m])) by (group, resource) OR on() vector(0)
xcp_central_validation_webhook_http_error_countN/A
sum(rate(xcp_central_validation_webhook_http_error_count[5m])) OR on() vector(0)

Zipkin Operational status

Operational metrics to indicate Tetrate Service Bridge Zipkin stack health.

Requests per second

Rate of HTTP requests to Zipkin by method, URL and response code.

Metric NameLabelsPromQL Expression
http_server_requests_seconds_countcomponent plane
sum by(method, uri, status) (rate(http_server_requests_seconds_count{component="zipkin", plane="management"}[1m]))

Requests latency

Latency of HTTP requests to Zipkin.

Metric NameLabelsPromQL Expression
http_server_requests_seconds_bucketcomponent plane
histogram_quantile(0.99 , sum(rate(http_server_requests_seconds_bucket{component="zipkin", plane="management"}[1m])) by (le))
http_server_requests_seconds_bucketcomponent plane
histogram_quantile(0.95 , sum(rate(http_server_requests_seconds_bucket{component="zipkin", plane="management"}[1m])) by (le))
http_server_requests_seconds_bucketcomponent plane
histogram_quantile(0.75 , sum(rate(http_server_requests_seconds_bucket{component="zipkin", plane="management"}[1m])) by (le))
http_server_requests_seconds_bucketcomponent plane
histogram_quantile(0.50 , sum(rate(http_server_requests_seconds_bucket{component="zipkin", plane="management"}[1m])) by (le))

Dropped messages/spans

The rate of messages and spans dropped by Zipkin. Note: a span could be dropped if it's a duplicate.

Metric NameLabelsPromQL Expression
zipkin_collector_messages_dropped_totalplane
sum(rate(zipkin_collector_messages_dropped_total{plane="management"}[5m]))
zipkin_collector_spans_dropped_totalplane
sum(rate(zipkin_collector_spans_dropped_total{plane="management"}[5m]))

Elasticsearch requests

The rate of Zipkin requests to Elasticsearch backend, by method and result.

Metric NameLabelsPromQL Expression
elasticsearch_requests_totalcomponent plane
sum by (method, result) (rate(elasticsearch_requests_total{component="zipkin", plane="management"}[1m]))

Zipkin Collector Throughput

Cumulative spans and messages read by Zipkin collector; relates to messages reported by instrumented apps

Metric NameLabelsPromQL Expression
zipkin_collector_message_spansplane
sum (zipkin_collector_message_spans{plane="management"})
zipkin_collector_spans_totalplane
sum (rate(zipkin_collector_spans_total{plane="management"}[5m]))

Zipkin Bytes in Message

Last size of a message received by Zipkin Collector.

Metric NameLabelsPromQL Expression
zipkin_collector_message_bytesplane
sum(zipkin_collector_message_bytes{plane="management"})

Zipkin bytes/sec

Cumulative rate of data received by Zipkin; should relate to messages reported by instrumented apps.

Metric NameLabelsPromQL Expression
zipkin_collector_bytes_totalplane
sum(rate(zipkin_collector_bytes_total{plane="management"}[5m]))

Zipkin Spans in Message

Last count of spans in a message received by Zipkin Collector.

Metric NameLabelsPromQL Expression
zipkin_collector_message_spansplane
sum(zipkin_collector_message_spans{plane="management"})

Threads

The number of threads in Zipkin by status.

Metric NameLabelsPromQL Expression
jvm_threads_daemon_threadscomponent plane
sum(jvm_threads_daemon_threads{component="zipkin", plane="management"})
jvm_threads_live_threadscomponent plane
sum(jvm_threads_live_threads{component="zipkin", plane="management"})
jvm_threads_peak_threadscomponent plane
sum(jvm_threads_peak_threads{component="zipkin", plane="management"})
jvm_threads_states_threadscomponent plane
jvm_threads_states_threads{component="zipkin", plane="management"}

Garbage Collection

Max GC Pause on Zipkin by cause.

Metric NameLabelsPromQL Expression
jvm_gc_pause_seconds_maxcomponent plane
sum by (cause) (jvm_gc_pause_seconds_max{component="zipkin", plane="management"})

JVM Classes

The number of classes that are currently loaded in the Zipkin JVM.

Metric NameLabelsPromQL Expression
jvm_classes_loaded_classescomponent plane
sum (jvm_classes_loaded_classes{component="zipkin", plane="management"})
jvm_classes_unloaded_classes_totalcomponent plane
sum (jvm_classes_unloaded_classes_total{component="zipkin", plane="management"})

JVM Memory

JVM Memory stats for Zipkin instance.

Metric NameLabelsPromQL Expression
jvm_buffer_total_capacity_bytescomponent plane
sum by (id, instance) (jvm_buffer_total_capacity_bytes{component="zipkin", plane="management"})
jvm_memory_max_bytescomponent plane
sum by (area, instance) (jvm_memory_max_bytes{component="zipkin", plane="management"})