6

When using Kafka as an event store, how is it possible to configure the logs never to lose data (v0.10.0.0) ?

I have seen the (old?) log.retention.hours, and I have been considering playing with compaction keys, but is there simply an option for kafka never to delete messages ?

Or is the best option to put a ridiculously high value for the retention period ?

Community
  • 1
  • 1
nha
  • 17,623
  • 13
  • 87
  • 133

3 Answers3

9

You don't have a better option that using a ridiculously high value for the retention period.

Fair warning : Using an infinite retention will probably hurt you a bit.

For example, default behaviour only allows a new suscriber to start from start or end of a topic, which will be at least annoying in an event sourcing perspective.

Also, Kafka, if used at scale (let's say tens of thousands of messages per second), benefits greatly for high performance storage, the cost of which will be ridiculously high with an eternal retention policy.

FYI, Kafka provides tools (Kafka Connect e.g) to easily persist data on cheap data stores.

C4stor
  • 8,355
  • 6
  • 29
  • 47
  • I was not aware of the "only from start or end" behaviour and this could definitely be a problem. Are there workarounds to make something like "read the last 100 messages" ? – nha Jun 24 '16 at 15:00
  • Not that I know of, your best bet would be to filter out messages based on some criteria (I guess a time based field in your data ?). – C4stor Jun 25 '16 at 19:22
5

Update: It’s Okay To Store Data In Apache Kafka

Obviously this is possible, if you just set the retention to “forever” or enable log compaction on a topic, then data will be kept for all time. But I think the question people are really asking, is less whether this will work, and more whether it is something that is totally insane to do.

The short answer is that it’s not insane, people do this all the time, and Kafka was actually designed for this type of usage. But first, why might you want to do this? There are actually a number of use cases, here’s a few:

nha
  • 17,623
  • 13
  • 87
  • 133
  • While this is definitely an informed article, I feel that it adresses none of my concerns regarding both disk cost and actual data replaying. Any information on that ? – C4stor Oct 31 '17 at 08:43
2

People concerned with data replaying and disk cost for eternal messages, just wanted to share some things.

Data replaying: You can seek your consumer consumer to a given offset. It is possible even to query offset given a timestamp. Then, if your consumer doesn't need to know all data from beginning but a subset of the data is enough, you can use this.

I use kafka java libs, eg: kafka-clients. See: https://kafka.apache.org/0101/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#offsetsForTimes(java.util.Map)

and https://kafka.apache.org/0101/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#seek(org.apache.kafka.common.TopicPartition,%20long)

Disk cost:

You can at least minimize disk space usage a lot by using something like Avro (https://avro.apache.org/docs/current/) and compation turned on.

Maybe there is a way to use symbolic links to separate between file systems. But that is only an untried idea.

Tonsic
  • 890
  • 11
  • 15