I am using apache kafka to produce and consume a file 5GB in size. I want to know if there is a way where the message from the topic is automatically removed after it is consumed. Do I have any way to keep track of consumed messages? I don't want to delete it manually.
-
The usage of traditional message brokers will do in your case if you have the choice. – Yassin Hajaj Dec 26 '20 at 14:44
5 Answers
In Kafka, the responsibility of what has been consumed is the responsibility of the consumer and this is also one of the main reasons why Kafka has such great horizontal scalability.
Using the high level consumer API will automatically do this for you by committing consumed offsets in Zookeeper (or a more recent configuration option is using by a special Kafka topic to keep track of consumed messages).
The simple consumer API make you deal with how and where to keep track of consumed messages yourself.
Purging of messages in Kafka is done automatically by either specifying a retention time for a topic or by defining a disk quota for it so for your case of one 5GB file, this file will be deleted after the retention period you define has passed, regardless of if it has been consumed or not.

- 6,434
- 1
- 35
- 37
-
3For cases where you want to consume a message, modify it and push it back to a different topic, it makes sense to delete messages as soon as they are consumed. Otherwise you end up with 2 copies of everything for your retention period. If you can have a retention period on the first topic but delete consumed messages this is ideal. – MikeKulls Jul 25 '18 at 01:39
-
1Are you sure that data will be deleted from topics after retaliation policy expiration even-though the message is not consumed ? this means that while consuming data from given partition, consumer will see "holes" or missing messages. Isnt it breaking kafka`s promise of reliable message passing medium? – ankit patel Aug 28 '20 at 19:09
You cannot delete a Kafka message on consumption. Kafka does not have a mechanism to directly delete a message when it is consumed.
The closest thing I found at an attempt to do this is this trick but it is untested and by design it will not work on the most recent messages:
A potential trick to do this is to use a combination of (a) a compacted topic and (b) a custom partitioner (c) a pair of interceptors.
The process would follow:
- Use a producer interceptor to add a GUID to the end of the key before it is written.
- Use a custom partitioner to ignore the GUID for the purposes of partitioning
- Use a compacted topic so you can then delete any individual message you need via producer.send(key+GUID, null)
- Use a consumer interceptor to remove the GUID on read.
But you should not need this capability:
Have 1 or more consumers, and want a message to be consumed only once in total by them?
Put them in the same consumer group.
Want to avoid too many messages filling up the disk?
Set up retention in terms of disk space and or time.

- 5,624
- 3
- 24
- 40

- 21,208
- 8
- 66
- 122
As per my Knowledge you can Delete the consumed data form the logs by reducing the Storage time. Default time for the log is set for 168 hours and then the Data is automatically removed from the Kafka-Topic which you created. So, my suggestion is to reduce the go to the server.properties
which is located in the config folder and the change the 168 to a minimum time. so their is no data after the specific amount of time which you have set for the log.retention.hours.So your issue will be solved.
log.retention.hours=168
Keep coding

- 444
- 1
- 6
- 16
-
34This isn't a solution to the OP's problem. It will delete any messages, whether they have been consumed or not. – Robin May 31 '18 at 14:06
You can use consumer_group : Kafka guarantees that a message is only ever read by a single consumer in the group. https://www.tutorialspoint.com/apache_kafka/apache_kafka_consumer_group_example.htm

- 41
- 2
-
2@RKRK Could you add a bit more information to your comments? It might be obvious to you, but to a new user, it is probably not obvious how you think they should improve their answers. – David Buck May 24 '20 at 08:06
I just ran in this issue and built a script that can be run periodically to 'mark' consumed records as deleted. Kafka will not free the space immediately but delete partitions with offsets outside of the 'active' ones.
https://gist.github.com/ThePsyjo/b717d2eaca2deb09b8130b3e917758f6

- 59
- 1
- 3