3

I am using spark stream to process files periodically from HDFS and produce results to HDFS. Each worker in each micro-batch generate a small file. I want to prevent from generating such small files (output format is sequence file). Here is some potential solutions:

1- Each worker buffers outputs of itself. When its buffer reaches to the predefined threshold, it will writes it to hdfs.

2- Using repartition in each micro batch to merge outputs of multiple workers, then writing them as a single file.

3- Using another stream job to merge small files into bigger ones.

4- Writing Key-Value pairs into Hive and exporting big files from it.

but each one has its own drawbacks:

1- buffering increases disk accesses. Moreover, in the case of failure large amount of input must be processed again.

2- repartitioning increases the network traffic. Moreover, it may still be small, again.

3- merging doubles number of read and writes to hdfs.

4- according to Persisting Spark Streaming output its performance is not desirable.

My question: Are there any other solutions to this problem. What is the best practice to such problems.

Thanks

Community
  • 1
  • 1

0 Answers0