How to monitor kube-controller-manager

Monitoring kube-controller-manager is important, as it is a main component of Kubernetes control plane. Kube-controller-manager runs in master nodes and it takes care of the different controller processes. These controllers watch the status of the different services deployed through the API and take corrective actions in case real and desired status don’t match.

Kube-controller-manager takes care of nodes, workloads (replication controllers), namespaces (namespace controller) and service accounts (serviceaccount controller), among other things.

Kubernetes controller manager is a process in the master nodes

Getting metrics from kube-controller-manager

Controller-manager has been instrumented and it exposes Prometheus metrics by default, providing information about work-queues and requests to the API. This endpoint can be easily scraped, obtaining all of this information without any calculation.

We can test the endpoint running a curl from a pod with network access in master nodes:

curl  http://localhost:10252/metrics

It will return a long list of metrics with this structure (truncated):

# HELP ClusterRoleAggregator_adds (Deprecated) Total number of adds handled by workqueue: ClusterRoleAggregator
# TYPE ClusterRoleAggregator_adds counter
ClusterRoleAggregator_adds 602
# HELP ClusterRoleAggregator_depth (Deprecated) Current depth of workqueue: ClusterRoleAggregator
# TYPE ClusterRoleAggregator_depth gauge
ClusterRoleAggregator_depth 0
# HELP ClusterRoleAggregator_longest_running_processor_microseconds (Deprecated) How many microseconds has the longest running processor for ClusterRoleAggregator been running.
# TYPE ClusterRoleAggregator_longest_running_processor_microseconds gauge
ClusterRoleAggregator_longest_running_processor_microseconds 0
# HELP ClusterRoleAggregator_queue_latency (Deprecated) How long an item stays in workqueueClusterRoleAggregator before being requested.
# TYPE ClusterRoleAggregator_queue_latency summary
ClusterRoleAggregator_queue_latency{quantile="0.5"} 0
ClusterRoleAggregator_queue_latency{quantile="0.9"} 0
...

If we want to configure a Prometheus to scrape API endpoint, we can add this job to our targets:

- job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
        - role: pod
      relabel_configs:
    - action: labelmap
      regex: __meta_kubernetes_pod_label_(.+)
    - source_labels: [__meta_kubernetes_namespace]
      action: replace
      target_label: kubernetes_namespace
    - source_labels: [__meta_kubernetes_pod_name]
      action: replace
      target_label: kubernetes_pod_name

In addition, we need to add annotations to the pod, so we have to modify the manifest in the master node located in /etc/kubernetes/manifests/kube-controller-manager.manifest and add these under annotations:

  prometheus.io/scrape: "true"
  prometheus.io/port: "10252"

What should I look at when monitoring kube-controller-manager?

Disclaimer: kube-controller-manager metrics might differ between Kubernetes versions. Here, we used Kubernetes 1.15. You can check the metrics available for your version in the Kubernetes repo (link for the 1.15.3 version).

Number of kube-controller-manager instances: This value will give an idea of the general health of the kubelet in the nodes. The expected value is the number of nodes in the cluster. You can obtain this value counting targets found by Prometheus or by checking the process if you have low-level access to the node.

A possible PromQL query for a single stat graph would be:

sum(up{k8s_app="kube-controller-manager"})

Workqueue information: It provides metrics with information about workqueue to detect possible bottlenecks or issues processing different commands. We will focus on aggregated metrics from all of the controllers, but you have different metrics available for queues of various controllers, like AWS controller, node controller or service account controller.

Workqueue latency: It’s the time that kube-controller-manager is taking to fulfill the different actions to keep the desired status of the cluster. A good way to represent this are quantiles:

histogram_quantile(0.99, sum(rate(workqueue_queue_duration_seconds_bucket{k8s_app="kube-controller-manager"}[5m])) by (instance, name, le))

Workqueue rate: It’s the number of required actions per unit time. A high value could indicate problems in the cluster of some of the nodes.

sum(rate(workqueue_adds_total{k8s_app="kube-controller-manager"}[5m])) by (instance, name)

Workqueue depth: It’s the number of actions waiting in the queue to be performed. It should remain in low values.

sum(rate(workqueue_depth{k8s_app="kube-controller-manager"}[5m])) by (instance, name)

Information about request to Api-server: It provides information about requests performed to the api-server so you can check that the connectivity is fine and that the api-server is providing the information needed to perform controller operations.

Latency:

histogram_quantile(0.99, sum(rate(rest_client_request_latency_seconds_bucket{k8s_app="kube-controller-manager"}[5m])) by (url, le))

Request rate and errors:

sum(rate(rest_client_requests_total{k8s_app="kube-controller-manager",code=~\"2..\"}[5m]))
sum(rate(rest_client_requests_total{k8s_app="kube-controller-manager",code=~\"3..\"}[5m]))
sum(rate(rest_client_requests_total{k8s_app="kube-controller-manager",code=~\"4..\"}[5m]))
sum(rate(rest_client_requests_total{k8s_app="kube-controller-manager",code=~\"5..\"}[5m]))

Saturation metrics (requires node_exporter):

CPU usage:

rate(process_cpu_seconds_total{k8s_app="kube-controller-manager"}[5m])

Memory usage:

process_resident_memory_bytes{k8s_app="kube-controller-manager"}

Examples of issues in kube-controller-manager

Workloads desired and current status mismatch

This can be caused by many different issues, but as the kube-controller-manager is the main component responsible with harmonizing current and desired status, we have a possible origin of the issue. Check that the kube-controller-manager instance is up and that the latency of API requests and workqueue are under normal values.

Kubernetes seems to be slow performing operations.

Check the latency and depth of workqueue in kube-controller-manager. It may have issues performing the actions with the API.

kube-controller-manager metrics in Sysdig Monitor

In order to get API server monitoring in Sysdig monitor, you have to add some sections to the agent yaml configuration file.

#Enable prometheus metrics
metrics_filter:
    # beginning of kube-controller-manager
    - include: "workqueue_adds_total"
    - include: "workqueue_depth"
    - include: "workqueue_queue_duration_seconds*"
    - include: "rest_client_requests_total"
    - include: "rest_client_request_latency_seconds*"
    - include: "go_goroutines"
    # end of kube-controller-manager
prometheus:
    enabled: true
    histograms: true
    max_metrics: 3000
    max_metrics_per_process: 3000
    process_filter:
      - include:
          kubernetes.pod.label.k8s-app: kube-controller-manager
          port: 10252
          conf:
            tags:
              kubernetes.component.name: kube-controller-manager
            host: 127.0.0.1
            port: 10252
            use_https: false

With the metrics_filter part, you ensure that these metrics won’t be discarded due to the custom metrics limit. If you are interested in a particular metric offered by the API server that is not in this list, you can add it.

In the second part, you set how Sysdig agent will scrape the metrics, searching the Kubernetes pods that have the label kube-controller-manager and scraping in localhost through port 10252. Since Sysdig agent is capable of switching network context and connecting to the pod as it was at localhost, we don’t need to use the node IP.

You can then build custom dashboards using these metrics. We have some pre-built dashboards that we can share with you if you are interested.

Control plane manager dashboard in Sysdig Monitor

Conclusion

Monitor kube-controller-manager is very important as it is a key component in Kubernetes control plane. Remember that kube-controller-manager is responsible for having the correct number of elements in all of the deployments, daemonsets, persistent volume claims and many other kubernetes elements.

An issue in kube-controller manager can compromise scalability and resilience of the applications running in the cluster. Monitoring kube-controller manager can allow you to avoid these complications that would be hard to detect otherwise. So, don’t forget to monitor your control plane!

Sysdig helps you follow Kubernetes monitoring best practices, which is just as important as monitoring your workloads and applications running inside the cluster. Request a demo today!

Stay up to date

Sign up to receive our newest.

Related Posts

How to monitor Kubelet

How to Monitor Kubernetes API Server

How to monitor Golden signals in Kubernetes