2

I am confused in understanding the splittable and non splittable file format in big data world . I was using zip file format and i understood that zip file are non splittable in a way that when i processed that file i had to use ZipFileInputFormat that basically unzipping it then processing it .

Then i moved to gzip format and i am able to process it in my spark job but i always had a doubt why people are saying gzip file format is also not splittable ?

How does it going to affect my spark job performance ?

So for example if have 5k gzip files with different sizes some of them are 1 kb and some of them are 10gb and if i am going to load it in spark what will happen ?

Should i use gzip in my case or any other compression ?if yes then why ?

Also what is the difference in the performance

CASE1: if i have a very huge (10gb) gzip file and then i load it in spark and run count on it

CASE2: If i have some splittable (bzip2) same size file and then load this in spark and run count on it

  • 2
    Possible duplicate of [Spark: difference when read in .gz and .bz2](https://stackoverflow.com/questions/37445054/spark-difference-when-read-in-gz-and-bz2) – Xavier Guihot Feb 22 '18 at 21:13
  • Both Gzip and Zip are not splitable. LZO, Snappy, and Bzip2 are the only splittable compressed formats, which means parallely processable, for this purpose – OneCricketeer Feb 23 '18 at 02:37
  • @cricket_007 so what is the significance of parallely processable in my example ..How will the performance will impact.. –  Feb 23 '18 at 03:36
  • You do understand what it means to run things in parallel, right? – OneCricketeer Feb 23 '18 at 03:37
  • @cricket_007 not in my case sorry i have little knowledge about that . –  Feb 23 '18 at 03:41
  • Okay, I trust you can find the wiki page on parallel or "distributed computing" – OneCricketeer Feb 23 '18 at 03:44

1 Answers1

5

First, you need to remember that both Gzip and Zip are not splitable. LZO and Bzip2 are the only splittable archive formats. Snappy is also splittable, but it's only a compression format.

For the purpose of this discussion, splittable files mean they are parallely processable across many machines rather than only one.

Now, to answer you questions :

if i have a very huge (10gb) gzip file and then i load it in spark and run count on it

Its loaded by only one CPU on one executor since the file is not splittable.

(bzip2) same size file and then load this in spark and run count on it

Divide the file size by the HDFS block size, and you should expect that many cores across all executors working on counting that file

Regarding any file less than the HDFS block size, there is no difference because it'll require consuming an entire HDFS block on one CPU just to count that one tiny file.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • I have upvoted because i found something useful for me .What is the meaning of loaded by one CPU ..So what do you think which should i use gzip or bzip2 for better performance ? –  Feb 23 '18 at 03:39
  • 1
    You need to understand what the computer is actually doing with your files. A CPU is the physical unit doing the work. If your file is not splittable, only one machine can process it. You must wait for the entire file to be read by it compared to many, many machines reading much smaller parts of the file in a less amount of time.... Now, you tell me which is better – OneCricketeer Feb 23 '18 at 03:43
  • 1
    @cricket_007 If I may, I found that comment too valuable not to figure in your answer – eliasah Feb 23 '18 at 08:25
  • 1
    Files processing & spark configuration to executor memory, num-executors, etc are explained very nicely here, https://spoddutur.github.io/spark-notes/distribution_of_executors_cores_and_memory_for_spark_application.html. This may not be in relevance to the original question, but since there are some queries raised about the CPU, memories, etc, this would help understand it better. – Yuva May 17 '18 at 06:47