One of our Kafka brokers had a very high load average (about 8 on average) in an 8 core machine. Although this should be okay but our cluster still seems to be facing problems and producers were failing to flush messages at the usual pace.
Upon further investigation, I found that my java process was waiting too much for IO, almost 99.99% of the time and as of now, I believe this is a problem.
Mind that this happened even when the load was relatively low (around 100-150 Kbps), I have seen it perform perfectly even with 2 Mbps of data input into the cluster.
I am not sure if this problem is because of Kafka, I am assuming it is not because all other brokers worked fine during this time and our data is perfectly divided among the 5 brokers.
Please assist me in finding the root cause of the problem. Where should I look to find the problem? Are there any other tools that can help me debug this problem?
We are using 1 TB mounted EBS Volume on an m5.2x large machine.
Please feel free to ask any questions.