This question tells that lz4 compression format is splittable and suitable for using in hdfs. Ok I have compressed 1.5 Gb data into 300 Mb lz4 file. If I try to read this file via spark - what the maximum workers count can it create to read file in parallel? Do splittable pieces count depend on lz4 compression level?
2 Answers
Compression will not impact the no of splittable pieces count
If the input file is compressed, then the bytes read in from HDFS is reduced, which means less time to read data. This time conservation is beneficial to the performance of job execution.

- 206
- 2
- 6
-
But what about parallel reading of 1 file? I mean I just compress with lz4 and everything works or not? Also should be input file splitted? For example is it needed to inpur.csv into input.csv.part01 and input.csv.part02 to make input.csv.lz4 read in 2 threads? – Cherry Mar 15 '18 at 08:19
-
yes it should be input.csv.part01 and input.csv.part02 ,because the ouput of map creates that parts. when you try to read again it will take like that only – sai pradeep kumar kotha Mar 15 '18 at 08:59
Compression codec that is splittable definitely matters and counts in Hadoop processing. I disagree with the previous answer. When you say splittable it essentially means you can have a mapper program that can read the logical split and process the data without worrying about the other parts of the split stored elsewhere in the datanode cluster with some compression algorithm.
For example, think about your windows zip file. If I had 10 GB file that and plan to zip with max size of split to be 100MB each then i create maybe 10 files of 100MB each (in total compressed to 1 GB). Can you write a program to process part of the file without unzipping the whole file back to its original state. This is the difference between splittable and unsplittable compression codec in hadoop context. For example, .gz is not splittable whereas bzip2 is possible. Even if you have a .gz file in Hadoop, you will have to first uncompresses the whole file across your datanode and then run program against the single file.This is not efficient and doesnt use power of Hadoop parallelism.
Lot of people confuse between splitting a compressed file into multiple parts in your windows or linux versus splitting a file in hadoop with compression codecs.
Lets come back to discussion of why compression with split matters. Hadoop essentially relies on mappers and reducers and each mapper can works upon the logical split of the file (not the physical block). If I had stored the file without splittablity then mapper will have to first uncompresses whole of the file before performing any operation on that record.
So be aware that input split is directly correlated with parallel processing in Hadoop.

- 362
- 2
- 10