Metric
A metric is a measurement about a service, captured at runtime. Logically, the moment of capturing one of these measurements is known as a metric event which consists not only of the measurement itself, but the time that it was captured and associated metadata..
The key aspects of a metric are the measure, the metric type, the metric origin, and the metric detect point:
- The measure describes the type and unit of a metric event also known as measurement.
- The metric type is the aggregation over time applied to the measurements.
- The metric origin tells from where the metric measurements come from.
- The detect point is the point from which the metric is observed, in service, server side, or client side. It is useful to differentiate between metrics that observe a concrete service (often self observing), or metrics that focus on service to service communications.
An TSB controlled (is part of the mesh and has a proxy we can configure) service has several metrics available which leverages a consistent monitoring of services. Some of them cover what is known as the RED metrics set, which are a set of very useful metrics for HTTP/RPC request based services. RED stands for:
- Rate (R): The number of requests per second.
- Errors (E): The number of failed requests.
- Duration (D): The amount of time to process a request.
To understand a bit better which metrics are available given a concrete telemetry source, let's assume we have deployed the classic Istio bookinfo demo application. Let's see some RED based metrics available for an observed and managed service by TSB, for instance the review service using the GLOBAL scoped telemetry source.
The following metric is the number of request per minute that the reviews service is handling at a GLOBAL scope:
apiVersion: observability.telemetry.tsb.tetrate.io/v2
kind: Metric
metadata:
organization: myorg
service: reviews.bookinfo
source: reviews
name: service_cpm
spec:
observedResource: organizations/myorg/services/reviews.bookinfo
measure:
type: REQUESTS
unit: "{request}"
metricType:
type: CPM
origin: MESH_OBSERVED
detectPoint: SERVER_SIDE
The metric for the average duration of the handled request by the reviews service at a GLOBAL scope:
apiVersion: observability.telemetry.tsb.tetrate.io/v2
kind: Metric
metadata:
organization: myorg
service: reviews.bookinfo
source: reviews
name: service_resp_time
spec:
observedResource: organizations/myorg/services/reviews.bookinfo
measure:
type: LATENCY
unit: ms
metricType:
type: AVERAGE
origin: MESH_OBSERVED
detectPoint: SERVER_SIDE
The metric for the errors of the handled request by the reviews at a GLOBAL scope. In this case the number of errors are expresses as a percentage of the total number of handled requests:
apiVersion: observability.telemetry.tsb.tetrate.io/v2
kind: Metric
metadata:
organization: myorg
service: reviews.bookinfo
source: reviews
name: service_sla
spec:
observedResource: organizations/myorg/services/reviews.bookinfo
measure:
type: STATUS
unit: NUMBER
metricType:
type: PERCENT
origin: MESH_OBSERVED
detectPoint: SERVER_SIDE
Using a different telemetry source for the same metric will gives a different view of the same observed measurements. For instance, if we want to know how many requests per minute subset v1 from the reviews is handling, we need to use the same metric but from a different telemetry source, in this case reviews-v1:
apiVersion: observability.telemetry.tsb.tetrate.io/v2
kind: Metric
metadata:
organization: myorg
service: reviews.bookinfo
source: reviews-v1
name: service_cpm
spec:
observedResource: organizations/myorg/services/reviews.bookinfo
measure:
type: REQUESTS
unit: NUMBER
metricType:
type: CPM
origin: MESH_OBSERVED
detectPoint: SERVER_SIDE
The duration or latency measurements can also be aggregated in different percentiles over time. The duration percentiles for the handled request by the reviews at a GLOBAL scope:
apiVersion: observability.telemetry.tsb.tetrate.io/v2
kind: Metric
metadata:
organization: myorg
service: reviews.bookinfo
source: reviews
name: service_percentile
spec:
observedResource: organizations/myorg/services/reviews.bookinfo
measure:
type: LATENCY
unit: ms
metricType:
type: PERCENTILE
labels:
- key: "0"
value: "p50"
- key: "1"
value: "p75"
- key: "2"
value: "p90"
- key: "3"
value: "p05"
- key: "4"
value: "p99"
origin: MESH_OBSERVED
detectPoint: SERVER_SIDE
Measure
A measure represents the name and unit of a measurement. For example, request latency in ms and the number of errors are examples of measures to collect from a server. In this case latency would be the type and ms (millisecond) is the unit.
Field | Description | Validation Rule |
name | string | – |
unit | string | – |
MeshControlledMeasureNames
The name of measures available for a controlled service in the mesh.
Name | Number | Description |
INVALID_MEASURE_TYPE | 0 | |
COUNTABLE | 1 | Represents discrete instances of a countable quantity. And integer count of something SHOULD use the default unit, the unity. Countable is a generalized measure name that can be used for many common countable quantities. Because of the generalized name, annotations with curly braces to give additional meaning. Networks packets, system paging faults are countable measures examples. |
REQUESTS | 2 | Requests is a specialized countable measure that represents the number of requests. |
LATENCY | 3 | The time taken by each request. |
STATUS | 4 | The success or failure of a request. |
HTTP_RESPONSE_CODE | 5 | The response code of the HTTP response, and if this request is the HTTP call. E.g. 200, 404, 302 |
RPC_RESPONSE_CODE | 6 | The value of the rpc response code. |
SIDECAR_INTERNAL_ERROR_CODE | 7 | The sidecar/gateway proxy internal error code. The value is based on the implementation. |
SIDECAR_RETRY_EXCEEDED | 8 | The sidecar/gateway proxy internal error code. The value is based on the implementation. |
TCP_INFO_RECEIVED_BYTES | 9 | The received bytes of the TCP traffic, if this request is a TCP call. |
TCP_INFO_SEND_BYTES | 10 | The sent bytes of the TCP traffic, if this request is a TCP call. |
MTLS_IN_USE | 11 | If mutual tls is in use in the connections between services. |
SIDECAR_HEAP_MEMORY_USED | 12 | Current reserved heap size in bytes. New Envoy process heap size on hot restart. |
SIDECAR_MEMORY_ALLOCATED | 14 | Current amount of allocated memory in bytes. Total of both new and old Envoy processes on hot restart. |
SIDECAR_PHYSICAL_MEMORY | 15 | Current estimate of total bytes of the physical memory. New Envoy process physical memory size on hot restart. |
SIDECAR_TOTAL_CONNECTIONS | 16 | Total connections of both new and old Envoy processes. |
SIDECAR_PARENT_CONNECTIONS | 17 | Total connections of the old Envoy process on hot restart. |
SIDECAR_WORKER_THREADS | 18 | Number of worker threads. |
SIDECAR_BUG_FAILURES | 19 | Number of envoy bug failures detected in a release build. File or report the issue if this increments as this may be serious. |
Metric
A metric is a measurement about a service, captured at runtime. Logically, the moment of capturing one of these measurements is known as a metric event which consists not only of the measurement itself, but the time that it was captured and associated metadata.
Application and request metrics are important indicators of availability and performance. Custom metrics can provide insights into how availability indicators impact user experience or the business. Collected data can be used to alert of an outage or trigger scheduling decisions to scale up a deployment automatically upon high demand.
Field | Description | Validation Rule |
observedResource | string | – |
measure | tetrateio.api.tsb.observability.telemetry.v2.Measure | – |
type | tetrateio.api.tsb.observability.telemetry.v2.MetricType | – |
origin | tetrateio.api.tsb.observability.telemetry.v2.MetricOrigin | – |
detectionPoint | tetrateio.api.tsb.observability.telemetry.v2.MetricDetectionPoint | – |
MetricDetectionPoint
From which detection point the metric is observed.
Name | Number | Description |
INVALID_METRIC_DETECTION_POINT | 0 | |
IN_SERVICE | 1 | Self observability metrics uses in service detect point. |
CLIENT_SIDE | 2 | Client side is how the client is observing the metric. When service A calls service B, service A acts as a client side. |
SERVER_SIDE | 3 | Server side is how the server is observing the metric. When service A calls service B, service B acts as the server side. |
MetricOrigin
From where the metric measurements come from.
Name | Number | Description |
INVALID_METRIC_ORIGIN | 0 | |
MESH_CONTROLLED | 1 | The metrics origin is from a TSB configured mesh, capturing the metrics from the sidecar's available observability. |
AGENT_OBSERVED | 2 | An agent which can be standalone or service with automatically instrumentation via byte code injection. Currently not available. Part of hybrid observability. |
MESH_IMPORTED | 3 | Other known mesh generated metrics that are not configured and handled by TSB. Currently not available. Part of hybrid observability. |
EXTERNAL_IMPORTED | 4 | External captured metrics that are either imported into TSB observability stack or queried at runtime. Currently not available. Part of hybrid observability. |
MetricType
Metric types are the aggregation function applied to the measurements that took place over a period of time. Some metric types like LABELED_COUNTER and PERCENTILE also additionally aggregated over the set of defined labels.
Field | Description | Validation Rule |
name | tetrateio.api.tsb.observability.telemetry.v2.MetricType.Type | – |
labels | List of tetrateio.api.tsb.observability.telemetry.v2.MetricType.Label | – |
Label
Label of metric type. Also seen a other dimensions of aggregation besides the time interval on which measurements are aggregated over.
Field | Description | Validation Rule |
key | string | – |
value | string | – |
Type
Name | Number | Description |
INVALID_METRIC_TYPE | 0 | |
GAUGE | 1 | Is the last seen measurement over a period of time. |
COUNTER | 2 | Is the sum of number of measurement over a period of time. Used in number of request style of metrics. |
AVERAGE | 3 | Average function applied to the measurements. Used in Duration/latency style of metrics. |
PERCENT | 4 | Percentage function applied to a given observed value over the total observer values. Used in SLA style of metrics, for example the percentage of errored responses over the total server responses. |
APDEX | 5 | Application Performance Index monitors end-user satisfaction. Apdex score |
HEATMAPS | 6 | Heat maps are a three dimensional visualization, using x and y coordinates for two dimensions, and color intensity for the third. They can reveal detail that summary statistics, such as line charts of averages, can miss. Latency measurements can be aggregated using Heatmaps/histograms. One dimension is often time, the other is the latency, and the third one (the intensity) is the frequency of that latency in the given time range. |
LABELED_COUNTER | 7 | Is the sum of number of measurement over time grouped by concrete label values. Used for counting responses by their http response code for instance. |
PERCENTILE | 8 | This is a specific subtype of LABELED_COUNTER. Used in duration/latency style metrics. |
CPM | 10 | Calls per minute used. Used in requests per minute, or in 5xx http errors per minute, 4xx http errors per minute, among other metrics. |
MAX | 11 | Selects the highest measurement over a period of time. Envoy max allocated style metrics. |