Using spark streaming to read and process messages from Kafka and write to HDFS - Hive. Since I wish to avoid creating many small files which spams the filesystem, I would like to know if there's a way to ensure a minimal file size, and/or ability to force a minimal number of output rows in a file, with the exception of a timeout. Thanks.
-
Have you considered using [Kafka Connect HDFS](https://docs.confluent.io/current/connect/connect-hdfs/docs/index.html) connector to write to HDFS from the Kafka topic? Along with Kafka Streams or KSQL to do any required processing. – Robin Moffatt May 08 '18 at 08:12
-
Can Kafka KSQL/Streams perform transformations on JSON files? I need the ability to process JSON and perform some transformations before writing it to HDFS-hive. Thanks. – DigitalFailure May 08 '18 at 08:32
-
Yes, they can. Here's a simple example of it in action: https://www.confluent.io/blog/using-ksql-to-analyse-query-and-transform-data-in-kafka – Robin Moffatt May 08 '18 at 09:14
-
Kafka messages are just bytes, not files. You can write any consumer to parse the JSON string, manipulate, write to a separate "enriched" or "filtered" topic, then sink that to HDFS – OneCricketeer May 09 '18 at 03:22
1 Answers
As far as I know, there is no way to control the number of lines in your output files. But you can control the number of output files.
Controlling that and considering your dataset size may help you with your needs, since you can calculate the size of each file in your output. You can do that with the coalesce
and repartition
commands:
df.coalesce(2).write(...)
df.repartition(2).write(...)
Both of them are used to create the number of partitions given as parameter. So if you set 2, you should have 2 files in your output.
The difference are that with repartition
you can both increase and decrease your partitions, while with coalesce
you can only decrease.
Also,keep in mind that repartition
performs a full shuffle to equally distribute the data among the partitions, which may be resource and time expensive. On the other hand, coalesce
does not perform a full shuffle, it combines existing partitions instead.
You can find an awesome explanation in this other answer here

- 7,808
- 5
- 32
- 49
-
Thanks for that. Suppose I have a Kafka topic with a 100 partitions, would it be an acceptable practice to have only (say) 10 partitions in your RDD so you output only 10 files? Also, since we are talking about streaming, files will need to close eventually, when does that happen? For every RDD? – DigitalFailure May 08 '18 at 08:37
-
Regarding the practice it depends on your dataset size. Each partition must have an appropiate size. This is it should be 1kb nor 1500Tb. – SCouto May 08 '18 at 10:12