3

One of our Kafka brokers had a very high load average (about 8 on average) in an 8 core machine. Although this should be okay but our cluster still seems to be facing problems and producers were failing to flush messages at the usual pace.

Upon further investigation, I found that my java process was waiting too much for IO, almost 99.99% of the time and as of now, I believe this is a problem.

Mind that this happened even when the load was relatively low (around 100-150 Kbps), I have seen it perform perfectly even with 2 Mbps of data input into the cluster.

I am not sure if this problem is because of Kafka, I am assuming it is not because all other brokers worked fine during this time and our data is perfectly divided among the 5 brokers.

Please assist me in finding the root cause of the problem. Where should I look to find the problem? Are there any other tools that can help me debug this problem?

We are using 1 TB mounted EBS Volume on an m5.2x large machine.

Please feel free to ask any questions.

itop snapshot

enter image description here

GC Logs Snapshot enter image description here

Ankur rana
  • 580
  • 10
  • 27
  • Did you check the GC log ? – Steephen Jan 09 '19 at 19:33
  • I am not capturing GC logs as of now, but my JMX logs indicate that there is indeed an increase in the GC time. Usually it remains less than 1%, but when this IO problem occurs, GC time increase as much as 15-20 % percent. I don't understand the relationship between GC and IO. Can you please help me understand how are these two related? – Ankur rana Jan 10 '19 at 02:11
  • Check is GC causing stop the world https://stackoverflow.com/questions/16695874/why-does-the-jvm-full-gc-need-to-stop-the-world? – Steephen Jan 10 '19 at 02:16
  • I will start capturing GC logs right away and will let you know? Or is there a way to know it without these logs? – Ankur rana Jan 10 '19 at 02:19
  • You have to check GC logs in my limited knowledge – Steephen Jan 10 '19 at 02:20
  • cool. I will start capturing. By the way, As far as I have read stop the world stops the java application it is running in. How is it effecting other applications in the system. I mean how come hdparm is also effected if GC in my java app is creating problems? – Ankur rana Jan 10 '19 at 02:43
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/186438/discussion-between-steephen-and-ankur-rana). – Steephen Jan 10 '19 at 02:47

1 Answers1

4

Answering my own question after figuring out the problem.

It turns out that the real problem was associated with the way st1 HDD drive works rather than kafka or GC.

st1 HDD volume type is optimized for workloads involving large, sequential I/O, and performs very bad with small random IOs. You can read more about it here. Although It should have worked fine for just Kafka, but we were writing Kafka application logs to the same HDD, which was adding a lot to the READ/WRITE IOs and subsequently depleting our burst credits very fast during peak time. Our cluster worked fine as long as we had burst credits available and the performance reduced after the credits depleted.

There are several solutions to this problem :

  1. First remove any external apps adding IO load to the st1 drive as its not meant for those kinds of small random IOs.
  2. Increase the number of such st1 parallel drives divide the load.This is easy to do with Kafka as it allows us to keep data in different directories in different drives. But only new topics will be divided as the partitions are assigned to directories when the topic is created.
  3. Use gp2 SSD drives as they kind of manage both kinds of loads very well. But these are expensive.
  4. Use larger st1 drives fit for your use case as the throughput and burst credits are dependent on the size of the disk. READ HERE

This article helped me a lot to figure out the problem.

Thanks.

Ankur rana
  • 580
  • 10
  • 27