Background:
We have one project that use spark process some log/csv files, each file size is very large, for example 20GB.
So we need to compress the log/csv file
Example
HDFS block size: 128M, and we have a 1GB log file.
if file is not compressed, there will be 8 blocks in HDFS
var rddFlat = sc.textFile("hdfs:///tmp/test.log")
rddFlat.partition.length will be 8 (since there will be 8 input splits)
if use bzip2, suppose after compression the compressed size is 256MB(actually bz has high compress ratio), there will be 2 blocks
var rddCompress = sc.textFile("hdfs:///tmp/test.log.bz2")
rddCompress.partition.length will be 2 (is it right?)
if we have the following transformation and action
var cnFlat = rddFlat.map(x => x.contains("error")).count();
var cnCompress = rddCompress.map(x => x.contains("error")).count();
My doubts
(The relation of HDFS block, input split and spark partitions of compressed file, splittable and not splittable compression)
How spark handle the compressed partitions?
each executor of spark will uncompressed their assinged partition into spark block and do the transformations and action on the block?
Which one is slower, if we remove the uncompress time?
the cnCompress calcualtion will be slower? for there are only 2 partitions and only two node will do the transformation and action. for the cnFlat there are 8 partitions.
When choosing a compress codec(splittableor not splittable), do we need to consider the compressed size?
After compression, if the compressed size is less or equal to HDFS block size. In terms of splittable, it make no sense whether we choose splittableor not splittable compression codec, because there will be only one partition of spark RDD(I mean only one worker will process the rdd)?