I am using spark stream to process files periodically from HDFS and produce results to HDFS. Each worker in each micro-batch generate a small file. I want to prevent from generating such small files (output format is sequence file). Here is some potential solutions:
1- Each worker buffers outputs of itself. When its buffer reaches to the predefined threshold, it will writes it to hdfs.
2- Using repartition in each micro batch to merge outputs of multiple workers, then writing them as a single file.
3- Using another stream job to merge small files into bigger ones.
4- Writing Key-Value pairs into Hive and exporting big files from it.
but each one has its own drawbacks:
1- buffering increases disk accesses. Moreover, in the case of failure large amount of input must be processed again.
2- repartitioning increases the network traffic. Moreover, it may still be small, again.
3- merging doubles number of read and writes to hdfs.
4- according to Persisting Spark Streaming output its performance is not desirable.
My question: Are there any other solutions to this problem. What is the best practice to such problems.
Thanks