2

Hey I'm currently trying to determine uptime of a pod with kube state metrics, specifically when a pod has started or stopped. I am using a Prometheus Deployment with Kube State metrics in order to determine when a pod has been started and stopped. Specifically I want to get the following metrics:

kube_pod_completion_time
kube_pod_created

As a test I've configured Prometheus to gather metrics with the following config.yml file:

        global:
            scrape_interval: 10m
            scrape_timeout: 10s
            evaluation_interval: 10m
        scrape_configs:
            - job_name: kubernetes-nodes-cadvisor
              honor_timestamps: true
              scrape_interval: 10m
              scrape_timeout: 10s
              metrics_path: /metrics
              scheme: https
              authorization:
                  type: Bearer
                  credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              tls_config:
                  ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                  insecure_skip_verify: true
              follow_redirects: true
              enable_http2: true
              relabel_configs:
                  - separator: ;
                    regex: __meta_kubernetes_node_label_(.+)
                    replacement: $1
                    action: labelmap
                  - separator: ;
                    regex: (.*)
                    target_label: __address__
                    replacement: kubernetes.default.svc:443
                    action: replace
                  - source_labels: [__meta_kubernetes_node_name]
                    separator: ;
                    regex: (.+)
                    target_label: __metrics_path__
                    replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
                    action: replace
              metric_relabel_configs:
                  - source_labels: [__name__]
                    regex: '(container_cpu_usage_seconds_total|container_fs_reads_bytes_total|container_fs_writes_bytes_total|container_memory_max_usage_bytes|container_network_receive_bytes_total|container_network_transmit_bytes_total)'
                    action: keep
              kubernetes_sd_configs:
                  - role: node
                    kubeconfig_file: ''
                    follow_redirects: true
                    enable_http2: true
            - job_name: 'kube-state-metrics'
              scrape_interval: 10m
              static_configs:
                - targets: ['kube-state-metrics.kube-system.svc.cluster.local:8080']
              metric_relabel_configs:
                - source_labels: [__name__]
                  regex: '(kube_pod_labels|kube_pod_created|kube_pod_completion_time|kube_pod_container_resource_limits)'
                  action: keep
        remote_write:
            - url: http://example.com
              remote_timeout: 30s
              follow_redirects: true
              enable_http2: true
              oauth2:
                  token_url: https://example.com
                  client_id: myCoolID
                  client_secret: myCoolPassword
              queue_config:
                  capacity: 2500
                  max_shards: 200
                  min_shards: 1
                  max_samples_per_send: 10
                  batch_send_deadline: 5s
                  min_backoff: 30ms
                  max_backoff: 5s
              metadata_config:
                  send: false

Additionally I also have the following test pod deployment running:

---
apiVersion: apps/v1
kind: Deployment
metadata:
    name: busy-box-test
spec:
    replicas: 1
    selector:
        matchLabels:
            app: busy-box-test
    template:
        metadata:
            labels:
                app: busy-box-test
        spec:
            containers:
                - command:
                      - sleep
                      - '300'
                  image: busybox
                  name: test-box

However when I go to search for metrics regarding kube_pod_completion_time I cannot find any in my remote write source, while I do have all the other metrics specified in the regex. (kube_pod_labels|kube_pod_created ... kube_pod_container_resource_limits)

Additionally I've tried the following command to see if they are present in the cluster: kubectl get --raw '/metrics' | grep kube_ and kubectl get --raw 'kube-state-metrics.kube-system.svc.cluster.local:8080' but I don't find anything definitive. I suspect the command is looking in the wrong location

So beyond if I am missing something obvious I missed I have the following open questions:

Is there an endpoint I should hit inside the cluster which should return the completion time? Is there an issue with the polling interval being once every 10 minutes for a pod that comes up and down every 5? (If anyone knows how long a terminated history will stick around in kube state metrics that would be great to know as well)

I've included the configuration for kube state metrics here: https://gist.github.com/twosdai/12607c8459bdb73fc98edbbcb17b5eb5 in order to keep the post a bit more concise. The cluster is running in AWS EKS Version: 1.22

  • For anyone in the future, I solved this problem by actually not using these metrics since they were unreliable. It was a better decision to create a time series of "statuses" specifically, to get the pod running status sent via remote write every few minutes and use that as a proxy for uptime. The reason is that if you miss the single terminated, or start event the whole system is thrown off, where as missing one "up" event is rather a small impact. – Daniel Wasserlauf Jun 29 '23 at 14:07

0 Answers0