1

I'm new to Apache Kafka and wonder to know how big a message can be in apache Kafka. Is it efficient to use Apache Kafka is the size of the messages become quite big, let's say hundreds of MB?

I have a scenario in which I would like to copy files to HDFS to be used by a Hadoop job, these files are also used by other process. I was thinking of copying the files into Apache Kafka first and then a consumer can copy them to HDFS and other consumers utilize the Kafka. Is this the best approach or not?

Thomas Jungblut
  • 20,854
  • 6
  • 68
  • 91
HHH
  • 6,085
  • 20
  • 92
  • 164
  • 1
    possible duplicate of [Kafka: Sending a 15MB message](http://stackoverflow.com/questions/21020347/kafka-sending-a-15mb-message) – Thomas Jungblut Apr 28 '15 at 20:11
  • My concern is mostly on the best architecture to achieve my goals,e.g. considering my scenario whether it is good to use Kafka or not – HHH Apr 28 '15 at 20:18

2 Answers2

1

max.message.bytes property defines largest message size Kafka will allow to be appended to a topic. Note that if you increase this size you must also increase your consumer's fetch size so they can fetch messages this large.

Also update more deatils about your source so that we can evaluate if kafka is best tool.

Karthik
  • 1,801
  • 1
  • 13
  • 21
0

Your architecture of having one consumer set simply write to HDFS, and another consumer set consume the same messages but for computation, for example, has been used in a real production deployment where I work to great effect.

As for your concern about the size of the message, if I assume that memory is unbounded, then there is no issue with your suggestion. Otherwise, if you have memory restrictions, then I would suggest that you break up each message into fixed-size chunks in the producer, because the message size in Kafka's brokers and consumers is a hard limit that you configure for all such messages in the topic, and so adjusting it is a royal pain. It should be easily possible to use fixed-size chunks with a key indicating the offset, and use the offset to reassemble each message at the consumer side.

This exact scenario has also arisen and solved in the aforementioned way. Good luck.

laughing_man
  • 3,756
  • 1
  • 20
  • 20