How to buffer outputs of spark stream to prevent generation of milions of small files?

Question

I am using spark stream to process files periodically from HDFS and produce results to HDFS. Each worker in each micro-batch generate a small file. I want to prevent from generating such small files (output format is sequence file). Here is some potential solutions:

1- Each worker buffers outputs of itself. When its buffer reaches to the predefined threshold, it will writes it to hdfs.

2- Using repartition in each micro batch to merge outputs of multiple workers, then writing them as a single file.

3- Using another stream job to merge small files into bigger ones.

4- Writing Key-Value pairs into Hive and exporting big files from it.

but each one has its own drawbacks:

1- buffering increases disk accesses. Moreover, in the case of failure large amount of input must be processed again.

2- repartitioning increases the network traffic. Moreover, it may still be small, again.

3- merging doubles number of read and writes to hdfs.

4- according to Persisting Spark Streaming output its performance is not desirable.

My question: Are there any other solutions to this problem. What is the best practice to such problems.

Thanks

Since my HDFS version was 1.2.1, I selected the option 3. In hadoop 2.X hdfs append feature is supported but I do not test it. — Masoud Sagharichian, Sep 02 '17 at 08:42

How to buffer outputs of spark stream to prevent generation of milions of small files?

0 Answers0