Spark: difference when read in .gz and .bz2

Question

I normally read and write files in Spark using .gz, which the number of files should be the same as the number of RDD partitions. I.e. one giant .gz file will read in to a single partition. However, if I read in one single .bz2, would I still get one single giant partition? Or will Spark support automatic split one .bz2 to multiple partitions?

Also, how do I know how many partitions it would be while Hadoop read in it from one bz2 file. Thanks!

axiom · Accepted Answer · 2016-05-25T21:15:09.557

    However, if I read in one single .bz2, would I still get one single giant partition?   
Or will Spark support automatic split one .bz2 to multiple partitions?

If you specify n partitions to read a bzip2 file, Spark will spawn n tasks to read the file in parallel. The default value of n is set to sc.defaultParallelism. The number of partitions is the second argument in the call to textFile (docs).

. one giant .gz file will read in to a single partition.

Please note that you can always do a

sc.textFile(myGiantGzipFile).repartition(desiredNumberOfPartitions)

to get the desired number of partitions after the file has been read.

Also, how do I know how many partitions it would be while Hadoop read in it from one bz2 file.

That would be yourRDD.partitions.size for the scala api or yourRDD.getNumPartitions() for the python api.

score 4 · Answer 2 · edited Dec 10 '18 at 12:02

4

I don't know why my test-program run on one executor, after some test I think I get it, like that:

by pySpark

// Load a DataFrame of users. Each line in the file is a JSON 

// document, representing one row.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val user = sqlContext.read.json("users.json.bz2")

edited Dec 10 '18 at 12:02

EstevaoLuis

2,422
7
33
40

answered Dec 10 '18 at 10:27

史荣琦

41
2

Spark: difference when read in .gz and .bz2

2 Answers2

Linked