1

By monitoring the real memory used by the container, it is found that the real memory of all containers is larger than that of all physical nodes. This is very strange.

However, I found in the monitored metrics that there was no container_ Name field, if no container is removed_ Name field. Only at this time can we find that the actual memory of the container is reasonable

Why does this happen (PS: container_name! = "pod" is excluded


sum(sum(container_memory_rss{container_name!="POD",container_name=~"[a-z].*"}) by (container_name))/1024^4

sum(sum(container_memory_rss{container_name!="POD") by (container_name))/1024^4 
Lin
  • 13
  • 4

1 Answers1

1

Here is what we use for mapping container memory metrics

sum by (container, pod, namespace, node, job)(container_memory_rss{container != "POD", image != "", container != ""})

To answer your specific question why the value is higher? that's because it includes the node memory itself.

kubelet (cadvisor) reports memory metrics for multiple groups for example, id="/" is the metric for the root cgroup (i.e. for the entire node)

e.g. In my setup the following metric is the node memory

{endpoint="https-metrics", id="/", instance="10.0.84.2:10250", job="kubelet", metrics_path="/metrics/cadvisor", node="ip-10-xx-x-x.us-west-2.compute.internal", service="kube-prometheus-stack-kubelet"}

Also at www.asserts.ai we use the max of rss, working and usage metrics, to arrive at the actual memory used by container.

see below a reference to our recording rule

      
      #
      - record: asserts:container_memory
        expr: sum by (container, pod, namespace, node, job, asserts_env, asserts_site)(container_memory_rss{container != "POD", image != "", container != ""})
        labels:
          source: rss

      - record: asserts:container_memory
        expr: sum by (container, pod, namespace, node, job, asserts_env, asserts_site)(container_memory_working_set_bytes{container != "POD", image != "", container != ""})
        labels:
          source: working

      - record: asserts:container_memory
        # why sum ? multiple copies of same container may be running on same pod
        expr: sum by (container, pod, namespace, node, job, asserts_env, asserts_site)
          (
          container_memory_usage_bytes {container != "POD", image != "", container != ""} -
          container_memory_cache {container != "POD", image != "", container != ""}-
          container_memory_swap {container != "POD", image != "", container != ""}
          )
        labels:
          source: usage

      # For KPI Rollup Purposes
      - record: asserts:resource:usage
        expr: |-
          max without (source) (asserts:container_memory)
          * on (namespace, pod, asserts_env, asserts_site) group_left(workload) asserts:mixin_pod_workload