TSB Alerting Guidelines
note
Tetrate Service Bridge collects a large number of metrics and the relations between those differ from environment to environment. This document only outlines generic alerting guidelines rather than providing an exhaustive list of alert configurations and thresholds, since these will differ between different environments with different workload patterns.
Overall, the alert configuration should follow several principles:
Every alert must be urgent and actionable. Alerts that do not require an immediate response should be notifications or tasks/tickets instead.
Number of alerts should be kept to a minimum to avoid alert fatigue of your oncall.
Avoid redundant alerts.
Alert on symptoms and not the cause, when applicable.
Every alert must have an up-to-date playbook/runbook that serves as a source of truth for impact, troubleshooting scenarios and documentation.
TSB Operational Status
TSB Availability
The rate of successful requests to TSB API. This is an extremely user-visible signal and should be treated as such.
The THRESHOLD
value should be established from a historical metrics data used as a baseline. A sensible value for a first iteration would be 0.99
.
Example PromQL expression:
sum(rate(grpc_server_handled_total{component="tsb", grpc_code="OK", grpc_type="unary", grpc_method!="SendAuditLog"}[1m])) BY (grpc_method) / sum(rate(grpc_server_handled_total{component="tsb", grpc_type="unary", grpc_method!="SendAuditLog"}[1m])) BY (grpc_method) < THRESHOLD
TSB Request Latency
TSB GRPC API request latency metrics are intentionally not emitted due to high metric cardinality. Tetrate is in the process of gathering feedback on the necessity and usefulness of API GRPC latency to be added back in future releases.
TSB Request Traffic
The raw rate of requests to TSB API. The monitoring value comes mostly from detecting outliers and unexpected behaviour, e.g. an unexpectedly high or low request rate. To establish correct thresholds, it is important to have the history of metrics data to gauge the baseline. Example PromQL expression:
sum(rate(grpc_server_handled_total{component="tsb", grpc_type="unary", grpc_method!="SendAuditLog"}[1m])) BY (grpc_method) < (or >) THRESHOLD
TSB Absent Metrics
TSB talks to its persistent backend even without constant external load. Absence of these requests reliably indicates an issue with TSB metrics collection and should be treated as a high-priority incident as the lack of metrics means the loss of visibility into TSB status.
note
One of the common cases of this issue is a deadlock in Opentelemetry collector. If this alert fires, one of the first steps should be to check the otel-collector
pod status and restart it if needed.
Tetrate is currently working with upstream maintainers to address this bug.
Example PromQL expression:
sum(rate(persistence_operation[10m])) == 0
Persistent Backend Availability
Persistent backend availability from TSB with no insight into the internal Postgres operations.
TSB stores all of its state in the persistent backend and as such, its operational status (availability, latency, throughput etc) is extremely tightly coupled with the status of persistent backend. TSB records the metrics for persistent backend operations that may be used as a signal to alert on.
It is important to note that any degradation in persistent backend operations will inevitably lead to overall TSB degradation, be it availability, latency or throughput. This means that alerting on persistent backend status may be redundant and the oncall person will receive 2 pages instead of one whenever there is a problem with Postgres that requires attention. However, such a signal still has significant value in providing important context to decrease the time to triage the issue and address the root cause/escalate. In this case alerting on the cause and not the symptom is a trade-off between having technically redundant alerts and reducing the time to triage the issue.
Note on treatment of "resource not found" errors: some level of "not found" responses is normal because TSB, for the purposes of optimisation, often uses Get
queries instead of Exists
in order to determine the resource existence. However, a large rate of "not found" (404-like) responses likely indicates an issue with the persistent backend setup.
Example PromQL expressions:
- Queries:
1 - ( sum(rate(persistence_operation{error!="", error!="resource not found"}[1m])) / sum(rate(persistence_operation[1m])) OR on() vector(0) ) < < THRESHOLD
- Too many "resource not found" queries:
( sum(rate(persistence_operation{error="resource not found"}[1m])) OR on() vector(0) / sum(rate(persistence_operation[1m])) ) > THRESHOLD (e.g. 0.50)
- Transactions:
sum(rate(persistence_transaction{error=""}[1m])) / sum(rate(persistence_transaction[1m])) < THRESHOLD
Persistent Backend Latency
The latency of persistent backend operations as recorded by the persistent backend client (TSB). This latency effectively translates to user-seen latency and as such is a vital signal.
The THRESHOLD
value should be established from a historical metrics data used as a baseline. A sensible value for a first iteration would be 300ms
99th percentile latency.
Example PromQL expressions:
- Queries
histogram_quantile(0.99, sum(rate(persistence_operation_duration_bucket[1m])) by (le, method)) > THRESHOLD
- Transactions:
histogram_quantile(0.99, sum(rate(persistence_transaction_duration_bucket[1m])) by (le)) > THRESHOLD
TSBD Operational Status
Last Management Plane Sync
The max time elapsed since tsbd last synced with the management plane for each registered cluster. This indicates how stale the configuration received from the management plane is in a given cluster. A reasonable first iteration threshold here is 30
(seconds).
Example PromQL expression:
time() - max(tsbd_tsb_latest_sync_time{cluster_name="$cluster"}) > THRESHOLD
TSBD Saturation
TSB Control Plane components are mostly CPU-constrained. Thus, the CPU utilisation serves as an important signal and should be alerted on. Keep in mind when choosing the alert THRESHOLDs that not only cloud providers tend to overprovision CPU, but even hyperthreading may have negative effects on Linux scheduler efficiency and lead to increased latencies/errors even at <~80% CPU utilisation.
Istio Operational Status
NB: this is not an exhaustive list of valuable signals that Istio Data Plane provides. For more in-depth information please refer to:
https://istio.io/latest/docs/examples/microservices-istio/logs-istio/ https://istio.io/latest/docs/ops/best-practices/observability/ https://istio.io/latest/docs/concepts/observability/
This document describes the absolute bare minimum alerting setup for Istio service mesh.
Proxy Convergence Time
Delay in seconds between config change and a proxy receiving all required configuration. This is another part of configuration propagation latency.
Example PromQL expression:
histogram_quantile(0.99, sum(rate(pilot_proxy_convergence_time_bucket{cluster_name="$cluster"}[1m])) by (le)) > THRESHOLD
Istiod Error Rate
The error rate of various Istiod operations. To establish correct thresholds, it is important to have the history of metrics data to gauge the baseline.
Example PromQL queries:
- Write Timeouts:
sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m])) > THRESHOLD
- Internal Errors:
sum(rate(pilot_total_xds_internal_errors{cluster_name="$cluster"}[1m])) > THRESHOLD
- Config Rejections:
sum(rate(pilot_total_xds_rejects{cluster_name="$cluster"}[1m])) > THRESHOLD
- Write Timeouts:
sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m])) > THRESHOLD
Configuration Validation
The success rate of Istio configuration validation requests. Elevated errors indicate that the Istio configuration generated by tsbd that is being propagated is not valid and this should be urgently addressed.
Example PromQL expression:
sum(rate(galley_validation_passed{cluster_name="$cluster"}[1m])) / ( sum(rate(galley_validation_passed{cluster_name="$cluster"}[1m])) + sum(rate(galley_validation_failed{cluster_name="$cluster"}[1m])) ) < THRESHOLD
Capacity Planning and Resource Saturation
TSB, tsbd, OAP/Zipkin Saturation
TSB components are mostly CPU-constrained in addition to OAP/Zipkin memory utilisation depending on the amount of telemetry/traces they collect. Thus, the CPU utilisation serves as an important signal and should be alerted on. Even though it is not a direct symptom of an issue affecting users, saturation provides a valuable signal that the system is underprovisioned/oversaturated before it makes a negative user impact.
Keep in mind when choosing the alert THRESHOLDs
that not only cloud providers tend to overprovision CPU, but even hyperthreading may have negative effects on Linux scheduler efficiency and lead to increased latencies/errors even at <~80% CPU utilisation.