7

we have 6 X kafka brokers with 256GB RAM, 24c/48T and they host 20 X 1.8TB SAS 10K rpm disks configured in raid10.

There are two spark streaming apps that

  • start their batches every 10 minutes
  • once they start, their first job is reading from the same kafka topic.
  • that topic has 200 partitions, spread evenly over the 6 brokers (33 partitions on each broker).
  • the streaming apps use kafka client 0.8.2.1 to consume from kafka

There are 21 instances of injectors that continuously writes to that topic at rate of 6K events/sec. they use the librdkafka poroducer in order to produce events to kafka.

when the stremaing apps wake up, their first job is reading from the topic. once they do so, the %util in the kafka disks goes to 90-100% for 30 sec-60 sec, and at the same time all of the injector instances get "Queue full" errors from their kafka producer. this is the producer configuration:

  • queue.buffering.max.kbytes: 2097151
  • linger.ms: 0.5

enter image description here

it can't be seen from this graph but during the time of high util% there are periods of time in which the writes are 0, and we assume during these times the producer of the injectors have their queue getting full and thus throw the "Queue full" error.

it's worth mentioning we use the deadline IO scheduler in the kafka machines which gives priority to read operations.

we have several thoughts about how to release the pressure on the writes:

  • to reduce unnecessary iops - change the kafka disks configuration from raid10 to non-raid ("jbod")
  • to spread the reads - make the spark apps to read from kafka at different times and not waking up at the same time
  • to change the priority of writes and reads - change the IO scheduler to CFQ

I write this question in order to verify we're on the right track and indeed the OS hault writes during reads due to the raid10, deadline IO scheduler and too much reads at the same time.

what do you think?

Elad Eldor
  • 803
  • 1
  • 12
  • 22
  • 2
    kafka doesnt do any intelligent caching - its all kernel page cache. how much of your memory goes to java heap vs kernel page cache? would you be willing to use read quotas to throttle the consumers down? how well spread-out is the load across all brokers? – radai Feb 27 '20 at 14:50
  • you're right, the consumers should read only from the OS page cache ("PC"), however there are many misses during reads because the current segment in all 200 partitions aren't fully in the PC when the consumer wake up (only 80% of the current segment is there). is it too much of intervention in the OS to use vmtouch in order to load all of the current segment to the PC? – Elad Eldor Mar 20 '20 at 18:59

1 Answers1

0

As you ask if this is moving in the right direction:

I think the steps you mention make sense.

In general I would always recommend not letting anything pull 100% disk capacity if it has to share those disks with something else which assumes they will have some IO available.

Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122