How to monitor Golden signals in Kubernetes

What are Golden signals metrics? How do you monitor golden signals in Kubernetes applications? Golden Signals can help you detect issues in a microservices application. These signals are a reduced set of metrics that offer a wide view of a service from a user or consumer perspective, so you can detect potential problems that might be directly affecting the behavior of the application.

Golden signals, a standard for Kubernetes application monitoring

Congratulations, you have successfully deployed your application in Kubernetes. This is the moment you discover your old monitoring tools are pretty much useless and that you’re unable to detect potential problems. Classic monitoring tools are usually based on static configuration files and were designed to monitor machines, not microservices or containers. In the container world, things change fast. Containers are created and destroyed at an incredible pace and it’s impossible to catch up without specific service discovery functions. According to the latest Sysdig Container Usage Report, 22% of the container lives for less than 10 seconds and 54% live less than five minutes.

Most of the modern monitoring systems offer a huge variety of metrics for many different purposes. It’s quite easy to drown in metrics and lose focus of what really is relevant for your application. Setting too many irrelevant alerts can drive you to a permanent emergency status and “alert burn out.” Imagine a node that is being heavily used and raising load alerts all of the time. You’re not doing anything about that as long as the services in the node work. Having too many alerts is as bad as not having any because important alerts get masked in a sea of irrelevance.

This is a problem that many people have faced and, fortunately, someone has already solved. The answer is the four Golden Signals, a term used first in the Google SRE handbook. Golden Signals are four metrics that will give you a very good idea of the real health and performance of your application as seen by the actors interacting with that service, whether they are final users or another service in your microservice application.

Picture from Denise Yu (@deniseyu21).

Golden signals metric: Latency explained

Latency is the time your system takes to serve a request against the service. This is an important sign to detect a performance degradation problem.

When using latency, it’s not enough to use average values as they can be misleading. For example, we have a service that shows an average of 100ms of response time. With only this information we can consider it pretty good, but the feedback of the users is that it’s perceived as slow.

The answer to this contradiction can be discovered using different statistical parameters, like standard deviation, that will give us an idea of the dispersion of the latency values. What if we have two kinds of requests, one of them very fast, and the other is slower because it’s more database intensive. If a typical user interaction has one slow request and 10 fast ones, the mean will probably be pretty low, but the application will be slow. Bottleneck analysis is important too, not only mean values.

A great tool to avoid this behavior are histogram metrics. These indicate the number of requests under different latency thresholds and allow them to aggregate in percentiles. A percentile is a value below which a given percentage of measures falls. For example, p99 says that 99% of my requests have a lower latency value than the percentile.

As you can see in the screenshot, average latency is acceptable, but if we look at percentiles we see a lot of dispersion in the values, giving a better idea of what the real latency perception is. Different percentiles express different information; p50 usually expresses general performance degradation and p95 – or p99 – allows detection of performance issues in specific requests or components of the system.

It may think that a high latency in 1% of the requests is not a big issue, but now think of a web application that needs several requests to be fully loaded and displayed. In this common escenario, a high latency in 1% of the requests can affect a much higher rate of final users, because one of these multiple requests is slowing down the performance of the whole application.

Another useful tool to analyze latency values can be the APDEX score that, given your SLA terms, can provide a general idea of how good your system condition is based on percentiles.

Golden signals metric: Errors explained

The rate of errors returned by your service is a very good indicator of deeper issues. It’s very important to detect not only explicit errors, but implicit errors as well.

An explicit error would be any kind of HTTP error code. These are pretty easy to identify as the error code is easily obtained from the reply headers and they are pretty consistent throughout many systems. Some examples of these errors could be authorization error (503), content not found (404) or server error (500). Error description can be very specific in some cases (418 – I’m a teapot).

On the other hand, implicit errors can be trickier to detect. How about a request with HTTP response code 200 but with an error message in the content? Different policy violations should be considered as errors too:

  • Errors that do not generate HTTP reply, as a request that took longer than the timeout.
  • Content error in an apparently successful request.

When using dashboards to analyze errors, mean values or percentiles do not make any sense. In order to properly see the impact of errors, the best way is to use rates. The percentage of requests that end in errors per second can give detailed information about when the system started to fail and with what impact.

Golden signals metric: Traffic / connections explained

Traffic or connections is an indicator of the amount of use of your service per time unit. It can have many different values depending on the nature of the system, like the number of requests to an API or the bandwidth consumed by a streaming app.

It can be useful to group the traffic indicators depending on different parameters, like response code or related to business logic.

Golden signals metric: Saturation explained

This metric should be the answer to a question: how full is my system?

Usually, saturation is expressed as a percentage of the maximum capacity, but each system will have different ways to measure saturation. The percentage could mean the number of users or requests obtained directly from the application or based upon estimations.

Most times, saturation is derived from system metrics, like CPU or memory, so they don’t rely on instrumentation and are collected directly from the system using different methods, like Prometheus node-exporter. Obtaining system metrics from a Kubernetes node is essentially the same as with any other system. At the end of the day, they are Linux machines.

It’s important to choose the adequate metrics and use as few as possible. The key to successfully measuring saturation is to choose the metrics that are constraining the performance of the system. If your application is processor intensive, use CPU load. If it’s memory intensive, choose used memory. The process of choosing saturation metrics is often a good exercise to detect bottlenecks in the application.

You should set alerts in order to detect saturation with some margin because usually, the performance drastically falls when saturation exceeds 80%.

Golden signals vs RED method vs USE method in Kubernetes

There are several approaches to design an efficient monitoring system for an application, but commonly they are based on the four Golden Signals. Some of them, like the RED method, give more importance to organic metrics, like requests rate, errors and latency. Others, like the USE method, focus on system level metrics and low level values like use of the CPU, memory and I/O. When do we need to use each approach?

RED method

RED method is focused on parameters of the application, without considering the infrastructure that runs the applications. It’s an external view of the service – how the clients see the service. Golden Signals try to add the infra component by adding the saturation value that will be necessarily implied from system metrics. This way we have a deeper view, as every service is unavoidably tied to the infrastructure running it. Maybe an external view is fine, but saturation will give you an idea of “how far” the service is from a failure.

USE method

USE method puts the accent on the utilization of resources, including errors in the requests as the only external indicator of problems. This method could overpass issues that affect some parts of the service. What if the database is slow due to a bad query optimization? That would increase latency but would not be noticeable in saturation. Golden Signals try to get the best of both methods including external observable and system parameters.

Having said this, all of these methods have a common goal – they try to homogenize and simplify complex systems in order to make incident detection easier. If you’re capable of detecting any issue with a little set of metrics, the process of scaling your monitoring to a big number of systems will be almost trivial.

Simplify monitoring, a good side effect

As a good side effect, reducing the number of metrics involved in incident detection helps to reduce alert fatigue due to arbitrary alerts set on metrics that will undoubtedly become a real issue or do not have a clear direct action path.

As a weakness, any simplification will remove details in the information received. It’s important to note that, despite Golden Signals being a good way to detect ongoing or future problems, once the problem is detected, the investigation process will require the use of different inputs to be able to dig deeper into the root cause of the problem. Any tool at hand can be useful for the troubleshooting process, like logs, custom metrics or different metric aggregation – for example, separate latency per deployment.

Golden Signal metrics instrumentation in Kubernetes

Instrumenting code with Prometheus metrics / custom metrics

In order to get Golden Signals with Prometheus, code changes (instrumentation) will be required. This topic is quite vast and has been covered in many previous articles like Prometheus metrics / OpenMetrics code instrumentation.

Prometheus has been positioned as a de facto standard for metric collecting, so most of the languages have a library to implement custom metrics in your application in a more convenient way. Nevertheless, instrumenting custom metrics requires a deep understanding of what the application does.

A poorly implemented code instrumentation can end up with time series cardinality bombing and a real chance to collapse your metrics collection systems. Using request id as a label, for example, generates a time series per request (seen in a real use case). Obviously, this is something you don’t want in your monitoring system as it increases the resources needed to collect the information and can potentially cause downtimes. Choosing a correct aggregation can be key to a successful monitoring approach.

Sysdig eBPF system call visibility (no instrumentation)

Sysdig monitor uses eBPF protocol to get information of all the system calls directly from the kernel. This way, your application doesn’t need any modification, neither in the code nor at container runtime. What’s running in your nodes is exactly the container you built with the exact version of the libraries, with your code (or binaries) intact.

System calls can give information about the processes running, memory allocation, network connections, access to the filesystem and resource usage, among other things. With this information, it’s possible to obtain meaningful metrics that will provide a lot of information about what is happening in your systems.

Golden Signals are some of the metrics available out-of-the-box, providing latency, requests rate, errors and saturation with a special added value that all of these metrics are correlated with the information collected from the Kubernetes API. This correlation allows you to do meaningful aggregations and represent the information using multiple dimensions:

  • Group latency by node: This will provide information about different problems with your Kubernetes infrastructure.
  • Group latency by deployment: This allows you to track problems in different microservices or applications.
  • Group latency by pod: Maybe a pod in your deployment is unhealthy.

These different levels of aggregation allow us to slice our data and locate issues, helping with troubleshooting tasks by digging into the different levels of the Kubernetes entities, from cluster to node, to deployment and then to pod.

Instrumenting code APM / Opentracing

Different APM (Application Performance Monitoring) applications can give very specific information about your application, including locating the code responsible for a specific action. This requires instrumentation, either with code changes or with adjustments on your application container.

This method requires the monitoring agent to load libraries in your application, explicitly in the code or implicitly (by binary preloading or modified runtimes). This means that what is running in production could not be the exact code you programmed in development, implying a risk of unforeseen problems and uncontrolled software updates. You are even exposed to a crash in your application due to a crash in the instrumentation code. Performance degradation can be an issue too, as APM requires more work to retrieve all the data.

In addition, you’re running third-party code in the instrumentation. Does your security team audit the code of the APM library?

Opentracing can be a good alternative to commercial APM as it presents an agnostic instrumentation method. It can be used with many different open source and commercial solutions, and it has a good community that takes care of the reliability and security of the libraries. One more thing, it’s under the CNCF umbrella.

The relation between APM and Golden Signals is somehow complex, because some of the parameters are related to the infrastructure – like saturation – and often, this is the weakest part of an APM approach.

You can find more information about this topic here: How to instrument code: Custom metrics vs APM vs OpenTracing.

Istio

Istio is a service mesh, a layer over the applications deployed in Kubernetes that provide different features to manage networking functions, like canary deployments, intelligent routing, circuit breakers, load balancing, network policy enforcement or health checks.

One of the features that Istio provides is visibility into the services with a limited tracing feature. It gives information about latency, errors and requests, making it a very good approach to easily obtain the Golden Signals. You can learn more about getting Istio metrics in our blog: How to monitor Istio.

A practical example of Golden signals in Kubernetes

As an example to illustrate the use of Golden Signals, we have deployed a simple go application example with Prometheus instrumentation. This application will apply a random delay between 0 and 12 seconds in order to give usable information of latency. Traffic will be generated with curl, with several infinite loops.

We have included a histogram to collect metrics related to latency and requests. These metrics will help us obtain the initial three Golden Signals: latency, request rate and error rate. We will obtain saturation directly with Prometheus and node-exporter, using – in this example – percentage of CPU in the nodes.

We have deployed the application in a Kubernetes cluster with Prometheus and Grafana, and generated a dashboard with Golden Signals. In order to obtain the data for the dashboards, we have used this PromQL sentences:

Latency:

sum(greeting_seconds_sum)/sum(greeting_seconds_count)  //Average
histogram_quantile(0.95, sum(rate(greeting_seconds_bucket[5m])) by (le)) //Percentile p95

Request rate:

sum(rate(greeting_seconds_count{}[2m]))  //Including errors
rate(greeting_seconds_count{code="200"}[2m])  //Only 200 OK requests

Errors per second:

sum(rate(greeting_seconds_count{code!="200"}[2m]))

Saturation:

We have used cpu percentage obtained with node-exporter:

100 - (avg by (instance) (irate(node_cpu_seconds_total{}[5m])) * 100)

Result dashboard:

This way we obtain this dashboard with the Golden signals:

This cluster also has the Sysdig agent installed. Sysdig allows us to obtain these same Golden Signals without the use of instrumentation (although Sysdig could pull in Prometheus metrics too!). With Sysdig, we could use a default dashboard and we would obtain the same meaningful information out-of-the-box!

Depending on the nature of the application, it’s possible to do different aggregations:

  • Response time segmented by response code.
  • Error rate segmented by response code.
  • CPU usage per service or deployment.

Caveats and gotchas of Golden signals in Kubernetes

Golden signals are one of the best ways to detect possible problems, but once the problem is detected you will have to use additional metrics and steps to further diagnose the problem. Detecting issues and resolving them are two different tasks and they require separate tools and views of the application.

Mean is not always meaningful; check standard deviation too, especially with latency. Take in consideration the request path of your application to look for bottlenecks. You should use percentiles instead of averages (or in addition to them).

Does it make sense to alert every time the CPU or load is high? Probably not. Avoid “alert burnout” setting alerts only in parameters that are clearly indicative of problems. If it is not an actionable alert, just remove it.

In the case where a parameter doesn’t look good but is not affecting your application directly, do not set an alert. Instead, create tasks in your backlog to analyze the behavior and avoid possible issues in the long term.

This article was originally published on June 11, 2019, and it’s been updated since.

Stay up to date

Sign up to receive our newest.

Related Posts

Monitoring Kubernetes in Production

How to instrument code: Custom metrics vs APM vs OpenTracing.

How to monitor Istio, the Kubernetes service mesh