Monitoring kube-controller-manager is important, as it is a main component of Kubernetes control plane. Kube-controller-manager runs in master nodes and it takes care of the different controller processes. These controllers watch the status of the different services deployed through the API and take corrective actions in case real and desired status don’t match.
Kube-controller-manager takes care of nodes, workloads (replication controllers), namespaces (namespace controller) and service accounts (serviceaccount controller), among other things.
Getting metrics from kube-controller-manager
Controller-manager has been instrumented and it exposes Prometheus metrics by default, providing information about work-queues and requests to the API. This endpoint can be easily scraped, obtaining all of this information without any calculation.
We can test the endpoint running a curl from a pod with network access in master nodes:
curl http://localhost:10252/metrics
It will return a long list of metrics with this structure (truncated):
# HELP ClusterRoleAggregator_adds (Deprecated) Total number of adds handled by workqueue: ClusterRoleAggregator # TYPE ClusterRoleAggregator_adds counter ClusterRoleAggregator_adds 602 # HELP ClusterRoleAggregator_depth (Deprecated) Current depth of workqueue: ClusterRoleAggregator # TYPE ClusterRoleAggregator_depth gauge ClusterRoleAggregator_depth 0 # HELP ClusterRoleAggregator_longest_running_processor_microseconds (Deprecated) How many microseconds has the longest running processor for ClusterRoleAggregator been running. # TYPE ClusterRoleAggregator_longest_running_processor_microseconds gauge ClusterRoleAggregator_longest_running_processor_microseconds 0 # HELP ClusterRoleAggregator_queue_latency (Deprecated) How long an item stays in workqueueClusterRoleAggregator before being requested. # TYPE ClusterRoleAggregator_queue_latency summary ClusterRoleAggregator_queue_latency{quantile="0.5"} 0 ClusterRoleAggregator_queue_latency{quantile="0.9"} 0 ...
If we want to configure a Prometheus to scrape API endpoint, we can add this job to our targets:
- job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name
In addition, we need to add annotations to the pod, so we have to modify the manifest in the master node located in /etc/kubernetes/manifests/kube-controller-manager.manifest and add these under annotations:
prometheus.io/scrape: "true" prometheus.io/port: "10252"
What should I look at when monitoring kube-controller-manager?
Disclaimer: kube-controller-manager metrics might differ between Kubernetes versions. Here, we used Kubernetes 1.15. You can check the metrics available for your version in the Kubernetes repo (link for the 1.15.3 version).
Number of kube-controller-manager instances: This value will give an idea of the general health of the kubelet in the nodes. The expected value is the number of nodes in the cluster. You can obtain this value counting targets found by Prometheus or by checking the process if you have low-level access to the node.
A possible PromQL query for a single stat graph would be:
sum(up{k8s_app="kube-controller-manager"})
Workqueue information: It provides metrics with information about workqueue to detect possible bottlenecks or issues processing different commands. We will focus on aggregated metrics from all of the controllers, but you have different metrics available for queues of various controllers, like AWS controller, node controller or service account controller.
Workqueue latency: It’s the time that kube-controller-manager is taking to fulfill the different actions to keep the desired status of the cluster. A good way to represent this are quantiles:
histogram_quantile(0.99, sum(rate(workqueue_queue_duration_seconds_bucket{k8s_app="kube-controller-manager"}[5m])) by (instance, name, le))
Workqueue rate: It’s the number of required actions per unit time. A high value could indicate problems in the cluster of some of the nodes.
sum(rate(workqueue_adds_total{k8s_app="kube-controller-manager"}[5m])) by (instance, name)
Workqueue depth: It’s the number of actions waiting in the queue to be performed. It should remain in low values.
sum(rate(workqueue_depth{k8s_app="kube-controller-manager"}[5m])) by (instance, name)
Information about request to Api-server: It provides information about requests performed to the api-server so you can check that the connectivity is fine and that the api-server is providing the information needed to perform controller operations.
Latency:
histogram_quantile(0.99, sum(rate(rest_client_request_latency_seconds_bucket{k8s_app="kube-controller-manager"}[5m])) by (url, le))
Request rate and errors:
sum(rate(rest_client_requests_total{k8s_app="kube-controller-manager",code=~\"2..\"}[5m])) sum(rate(rest_client_requests_total{k8s_app="kube-controller-manager",code=~\"3..\"}[5m])) sum(rate(rest_client_requests_total{k8s_app="kube-controller-manager",code=~\"4..\"}[5m])) sum(rate(rest_client_requests_total{k8s_app="kube-controller-manager",code=~\"5..\"}[5m]))
Saturation metrics (requires node_exporter):
CPU usage:
rate(process_cpu_seconds_total{k8s_app="kube-controller-manager"}[5m])
Memory usage:
process_resident_memory_bytes{k8s_app="kube-controller-manager"}
Examples of issues in kube-controller-manager
Workloads desired and current status mismatch
This can be caused by many different issues, but as the kube-controller-manager is the main component responsible with harmonizing current and desired status, we have a possible origin of the issue. Check that the kube-controller-manager instance is up and that the latency of API requests and workqueue are under normal values.
Kubernetes seems to be slow performing operations.
Check the latency and depth of workqueue in kube-controller-manager. It may have issues performing the actions with the API.
kube-controller-manager metrics in Sysdig Monitor
In order to get API server monitoring in Sysdig monitor, you have to add some sections to the agent yaml configuration file.
#Enable prometheus metrics metrics_filter: # beginning of kube-controller-manager - include: "workqueue_adds_total" - include: "workqueue_depth" - include: "workqueue_queue_duration_seconds*" - include: "rest_client_requests_total" - include: "rest_client_request_latency_seconds*" - include: "go_goroutines" # end of kube-controller-manager prometheus: enabled: true histograms: true max_metrics: 3000 max_metrics_per_process: 3000 process_filter: - include: kubernetes.pod.label.k8s-app: kube-controller-manager port: 10252 conf: tags: kubernetes.component.name: kube-controller-manager host: 127.0.0.1 port: 10252 use_https: false
With the metrics_filter part, you ensure that these metrics won’t be discarded due to the custom metrics limit. If you are interested in a particular metric offered by the API server that is not in this list, you can add it.
In the second part, you set how Sysdig agent will scrape the metrics, searching the Kubernetes pods that have the label kube-controller-manager and scraping in localhost through port 10252. Since Sysdig agent is capable of switching network context and connecting to the pod as it was at localhost, we don’t need to use the node IP.
You can then build custom dashboards using these metrics. We have some pre-built dashboards that we can share with you if you are interested.
Conclusion
Monitor kube-controller-manager is very important as it is a key component in Kubernetes control plane. Remember that kube-controller-manager is responsible for having the correct number of elements in all of the deployments, daemonsets, persistent volume claims and many other kubernetes elements.
An issue in kube-controller manager can compromise scalability and resilience of the applications running in the cluster. Monitoring kube-controller manager can allow you to avoid these complications that would be hard to detect otherwise. So, don’t forget to monitor your control plane!
Sysdig helps you follow Kubernetes monitoring best practices, which is just as important as monitoring your workloads and applications running inside the cluster. Request a demo today!