I'm reading Jacek Laskowski's online book about Apache Spark, and regarding partitioning, he states that
By default, a partition is created for each HDFS partition, which by default is 64MB
I'm not extremely familiar with HDFS, but I ran into some questions replicating this statement. I have a file called Reviews.csv
which is about 330MB text file of Amazon food reviews. Given the default 64MB blocks, I'd expect ceiling(330 / 64) = 6
partitions. However, when I load the files into my Spark Shell, I get 9 partitions:
scala> val tokenized_logs = sc.textFile("Reviews.csv")
tokenized_logs: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[1] at textFile at <console>:24
scala> tokenized_logs
res0: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[1] at textFile at <console>:24
scala> tokenized_logs.partitions
res1: Array[org.apache.spark.Partition] = Array(org.apache.spark.rdd.HadoopPartition@3c1, org.apache.spark.rdd.HadoopPartition@3c2, org.apache.spark.rdd.HadoopPartition@3c3, org.apache.spark.rdd.HadoopPartition@3c4, org.apache.spark.rdd.HadoopPartition@3c5, org.apache.spark.rdd.HadoopPartition@3c6, org.apache.spark.rdd.HadoopPartition@3c7, org.apache.spark.rdd.HadoopPartition@3c8, org.apache.spark.rdd.HadoopPartition@3c9)
scala> tokenized_logs.partitions.size
res2: Int = 9
I do notice that if I create another smaller version of Reviews.csv
called Reviews_Smaller.csv
that is only 135MB, I have a significantly reduced partition size:
scala> val raw_reviews = sc.textFile("Reviews_Smaller.csv")
raw_reviews: org.apache.spark.rdd.RDD[String] = Reviews_Smaller.csv MapPartitionsRDD[11] at textFile at <console>:24
scala> raw_reviews.partitions.size
res7: Int = 4
However, by my math, there should be ceiling(135 / 4) = 3
partitions, not 4.
I'm running everything locally, on my MacBook Pro. Can anyone help explain how the number of default partitions is calculated for HDFS?