I am confused in understanding the splittable and non splittable file format in big data world .
I was using zip file format and i understood that zip file are non splittable in a way that when i processed that file i had to use ZipFileInputFormat
that basically unzipping it then processing it .
Then i moved to gzip
format and i am able to process it in my spark job but i always had a doubt why people are saying gzip
file format is also not splittable ?
How does it going to affect my spark job performance ?
So for example if have 5k gzip files with different sizes some of them are 1 kb and some of them are 10gb and if i am going to load it in spark what will happen ?
Should i use gzip in my case or any other compression ?if yes then why ?
Also what is the difference in the performance
CASE1: if i have a very huge (10gb) gzip file and then i load it in spark and run count on it
CASE2: If i have some splittable (bzip2) same size file and then load this in spark and run count on it