14

I want to described the following case that was on one of our production cluster

We have ambari cluster with HDP version 2.6.4

Cluster include 3 kafka machines – while each kafka have disk with 5 T

What we saw is that all kafka disks was with 100% size , so kafka disk was full and this is the reason that all kafka brokers was failed

df -h /kafka
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb         5T   5T   23M   100% /var/kafka

After investigation we saw that log.retention.hours=7 days

So seems that purging is after 7 days and maybe this is the reason that kafka disks are full with 100% even if they are huge – 5T

What we want to do now – is how to avoid this case in the future?

So

We want to know – how to avoid full used capacity on kafka disks

What we need to set in Kafka config in order to purge the kafka disk according to the disk size – is it possible ?

And how to know the right value of log.retention.hours ? according to the disk size or other?

Giorgos Myrianthous
  • 36,235
  • 20
  • 134
  • 156
Judy
  • 1,595
  • 6
  • 19
  • 41

2 Answers2

22

In Kafka, there are two types of log retention; size and time retention. The former is triggered by log.retention.bytes while the latter by log.retention.hours.

In your case, you should pay attention to size retention that sometimes can be quite tricky to configure. Assuming that you want a delete cleanup policy, you'd need to configure the following parameters to

log.cleaner.enable=true
log.cleanup.policy=delete

Then you need to think about the configuration of log.retention.bytes, log.segment.bytes and log.retention.check.interval.ms. To do so, you have to take into consideration the following factors:

  • log.retention.bytes is a minimum guarantee for a single partition of a topic, meaning that if you set log.retention.bytes to 512MB, it means you will always have 512MB of data (per partition) in your disk.

  • Again, if you set log.retention.bytes to 512MB and log.retention.check.interval.ms to 5 minutes (which is the default value) at any given time, you will have at least 512MB of data + the size of data produced within the 5 minute window, before the retention policy is triggered.

  • A topic log on disk, is made up of segments. The segment size is dependent to log.segment.bytes parameter. For log.retention.bytes=1GB and log.segment.bytes=512MB, you will always have up to 3 segments on the disk (2 segments which reach the retention and the 3rd one will be the active segment where data is currently written to).

Finally, you should do the math and compute the maximum size that might be reserved by Kafka logs at any given time on your disk and tune the aforementioned parameters accordingly. Of course, I would also advice to set a time retention policy as well and configure log.retention.hours accordingly. If after 2 days you don't need your data anymore, then set log.retention.hours=48.

Giorgos Myrianthous
  • 36,235
  • 20
  • 134
  • 156
  • @Myrianthous I have little question - let say we have partition - called - jtr.avo.control.kolp-0 , and we set the log.retention.bytes to 512M , and log.retention.check.interval.ms to 5 minutes , is it mean that if this topic partition size will increased to 600M then after 5 min the data will purge and topic partition will return to the size - 512M ? am I right here ? – Judy Oct 24 '18 at 15:05
  • Given that the partition’s segment has closed (i.e. it has exceeded log.segment.bytes) then yes, that log segment will be deleted within 5 mins. – Giorgos Myrianthous Oct 24 '18 at 15:07
  • @Myrianthous , ok if my assumption is true , then actually what we get is data loss , lets take my example and for now topic partition is exactly 512M , its mean that every additional data that will written to this topic will be deleted !!! , so how we can except this , actually we have here data loss ? – Judy Oct 24 '18 at 15:13
  • The deletion takes place from the tail and therefore older messages will be the first to be deleted. Note that Kafka is not a database. You are supposed to consume the messages before deleting them. – Giorgos Myrianthous Oct 24 '18 at 15:14
  • BTW - in ambari kafka config we have - log.cleanup.interval.mins so I guess it actually the - log.retention.check.interval.ms – Judy Oct 24 '18 at 15:55
  • @Judy They are both intended to perform the same thing but be careful as `log.retention.check.interval.ms` takes milliseconds as input while `log.cleanup.interval.mins` takes minutes. – Giorgos Myrianthous Oct 24 '18 at 16:13
  • @Myrianthous , so from my side , we need also to know the number of topic partitions , so only after we know the number of all partitions we can make calculation of the other values , for example if we have 200 topic partitions and each topic partition will be limited to 1G , then all size of all partition topic will be max 200G , am I right ? – Judy Oct 24 '18 at 16:46
  • @Judy Yes but not exactly. It might be more than that as log.retention.bytes is the minimum guarantee. – Giorgos Myrianthous Oct 24 '18 at 17:55
  • @GiorgosMyrianthous So `log.retention.bytes` defines the maximum size of all segment files except for the active one? Am i understanding it correctly? – KevinZhou Jun 20 '20 at 04:31
2

I think you have three options:

1) Increase the size of the disks until you notice that you have a comfortable amount of space free thanks to your increase and current retention policy of 7 days. For me a comfortable amount free is around 40% (but that is personal preference).

2) Lower your retention policy to for example 3 days and see if your disks are still full after a period of time. The right retention period varies between different use cases. If you don't need a backup of the data on Kafka when something goes wrong then just pick a very low retention period. If it is crucial that you have need those 7 days worth of data then you should not change the period but the disk sizes.

3) A combination of the options 1 and 2.

More information about optimal retention policies: Kafka optimal retention and deletion policy

Stanko
  • 4,275
  • 3
  • 23
  • 51
  • is it possible to purge the data of the topics in kafka when for example disk used size is more then 80%? ( maybe some parameter in kafka config that will do this , but I am not sure if this option is real option ) – Judy Oct 24 '18 at 13:52
  • I think you could set `log.retention.bytes` to 4 terabyte. That is the maximum size of the log before deleting it. – Stanko Oct 24 '18 at 13:55
  • retention.bytes - This configuration controls the maximum size a partition (which consists of log segments) can grow to before we will discard old log segments to free up space if we are using the "delete" retention policy. By default there is no size limit only a time limit. Since this limit is enforced at the partition level, multiply it by the number of partitions to compute the topic retention in bytes. – Judy Oct 24 '18 at 13:57
  • 2
    `log.retention.bytes` only applies to individual log files, which are each only an individual partition belonging to a topic. So, unless each broker is only serving one partition from a single topic, this is not going to work. – mjuarez Oct 24 '18 at 13:58
  • as you can see log.retention.bytes , is per partition not the size of partial kafka disk – Judy Oct 24 '18 at 13:59
  • @mjuarez - do you think we can set some rule that will purge the logs when threshold is 80% disk used ? – Judy Oct 24 '18 at 14:00
  • @Judy Indeed, my bad. Then I would opt for lowering the retention policy. – Stanko Oct 24 '18 at 14:02