1

I got empty values for CPU and Memory, when I used igztop for check running pods in iguazio/mlrun solution. See the first line in output for this pod *m6vd9:

[ jist @ iguazio-system 07:41:43 ]->(0) ~ $ igztop -s cpu
+--------------------------------------------------------------+--------+------------+-----------+---------+-------------+-------------+
| NAME                                                         | CPU(m) | MEMORY(Mi) | NODE      | STATUS  | MLRun Proj. | MLRun Owner |
+--------------------------------------------------------------+--------+------------+-----------+---------+-------------+-------------+
| xxxxxxxxxxxxxxxx7445dfc774-m6vd9                             |        |            | k8s-node3 | Running |             |             |
| xxxxxx-jupyter-55b565cc78-7bjfn                              | 27     | 480        | k8s-node1 | Running |             |             |
| nuclio-xxxxxxxxxxxxxxxxxxxxxxxxxx-756fcb7f74-h6ttk           | 15     | 246        | k8s-node3 | Running |             |             |
| mlrun-db-7bc6bcf796-64nz7                                    | 13     | 717        | k8s-node2 | Running |             |             |
| xxxx-jupyter-c4cccdbd8-slhlx                                 | 10     | 79         | k8s-node1 | Running |             |             |
| v3io-webapi-scj4h                                            | 8      | 1817       | k8s-node2 | Running |             |             |
| v3io-webapi-56g4d                                            | 8      | 1827       | k8s-node1 | Running |             |             |
| spark-worker-8d877878c-ts2t7                                 | 8      | 431        | k8s-node1 | Running |             |             |
| provazio-controller-644f5784bf-htcdk                         | 8      | 34         | k8s-node1 | Running |             |             |

and It also was not possible to see performance metrics (CPU, Memory, I/O) for this pod in Grafana.

Do you know, how can I resolve this issue without whole node restart (and what is the root cause)?

JIST
  • 1,139
  • 2
  • 8
  • 30
  • 1
    Try using kubectl top nodes and kubectl top podm, check if metrics-server is installed and running on your cluster and with the help of kubectl describe command can you confirm if the pod is having cpu and memory. – Sai Chandra Gadde Jun 22 '23 at 14:28
  • Refer to this [troubleshooting metric-server](https://cloud.ibm.com/docs/containers?topic=containers-debug_metrics_server). – Sai Chandra Gadde Jun 22 '23 at 14:32

2 Answers2

1

Below troubleshooting steps will help you in resolving the issue:

1.Check if you can see the CPU and memory of the pod using describe command:

kubectl describe pods my-pod

2.Check if you can view CPU and memory of all pods and nodes using below commands:

kubectl top pod 

kubectl top node

3.Check if the metric server is running by using below command:

kubectl get apiservices v1beta1.metrics.k8s.io
kubectl get pod -n kube-system -l k8s-app=metrics-server

4.Check the CPU and memory of the pod using below queries:

CPU Utilisation Per Pod:

sum(irate(container_cpu_usage_seconds_total{container!="POD", container=~".+"}[2m])) by (pod)

RAM Usage Per Pod:

sum(container_memory_usage_bytes{container!="POD", container=~".+"}) by (pod)

5.Check logs of the pod and node, if you find any error attach those logs for further troubleshooting.

Sai Chandra Gadde
  • 2,242
  • 1
  • 3
  • 15
  • Thx, it seems as useful way for diagnostic – JIST Jun 23 '23 at 11:58
  • @jist, is this xxxxxxxxxxxxxxxx7445dfc774-m6vd9 the output from the igztop you masked the output with xxxxxxxxxxxxx? – xsqian Jun 26 '23 at 04:49
  • @xsqian, I only masked outputs, because name of projects and functions. – JIST Jun 26 '23 at 05:28
  • @xsqian, it has relation to kubelet, ... but the issue will be deeper. – JIST Jun 26 '23 at 05:32
  • @jist since it's related to the specific pod of your function, it's hard to tell by looking at the output from igztop. Would you like to create a support ticket? – xsqian Jun 26 '23 at 15:59
  • @xsqian, I made it (see #3066, #3008, #3075), but the root cause is not clear. – JIST Jun 26 '23 at 17:19
0

It seems as the issue with kubelet, the best is to follow the next step by step scenario (see diagram in pdf)

k8s diagram first part k8s diagram second part

XiongChan
  • 90
  • 12