2

I am interested in finding out how Spark creates partitions when loading a file from the local file system.

I am using the Databricks Community Edition to learn Spark. While I load a file that is just a few kilobytes in size (about 300 kb) using the sc.textfile command, spark, by default creates 2 partitions (as given by partitions.length). When I load a file that is about 500 MB, it creates 8 partitions (which is equal to the number of cores in the machine).

enter image description here

What is the logic here?

Also, I learnt from documentation that if we are loading from the local file system and using a cluster, the file has to be in the same location on all the machines that belong to the cluster. Will this not create duplicates? How does Spark handle this scenario? If you can point to articles that throw light on this, it will be of great help.

Thanks!

Sriram Rag
  • 21
  • 2

1 Answers1

5

When Spark reading from the Local file system the default number of Partitions (identified by defaultParallelism) is the number of all available cores.

sc.textFile calculates the number of partitions as the minimum between defaultParallelism ( available cores in case of Local FS) and 2.

def defaultMinPartitions: Int = math.min(defaultParallelism, 2)

Referred from: spark code

In 1st case: the file size - 300KB

Number of partitions are calculated as 2, as file size is very less.

In 2nd case: file size - 500MB

Number of partitions are equal to the defaultParallelism. In your case, it is 8.

When reading from HDFS, sc.textFile will take the maximum between minPartitions and the number of splits computed based on hadoop input split size divided by the block size.

However, when using textFile with compressed files (file.txt.gz not file.txt or similar), Spark disables splitting that makes for an RDD with only 1 partition (as reads against gzipped files cannot be parallelized).

For your 2nd query regarding reading data from Local path in Cluster:

Files need to be available on all the machines in the cluster, because Spark may launch the executors on machines in the cluster, and executors will read the file using (file://).

To avoid copying the files to all the machines, if your data is already in one of the network file systems like NFS, AFS, and MapR’s NFS layer, then you can use it as an input by just specifying a file:// path; Spark will handle it as long as the filesystem is mounted at the same path on each node. Every node needs to have the same path. Please Refer to: https://community.hortonworks.com/questions/38482/loading-local-file-to-apache-spark.html

Lakshman Battini
  • 1,842
  • 11
  • 25
  • Awesome explanation§ – Anthati Nagaraju Jul 28 '18 at 16:12
  • Thanks @Lakshman-Battini for the explanation. Here is my doubt. Based on Math.min(defaultparallelism,2), in both the cases (file size 300kb and 500MB), should the number of partitions not be 2? Since that would be the minimum value? How is it 8 for the 500 MB file? As to the second question, I understand the files should be on all the machines that are part of the cluster in the same path. This would mean each executor will load the entire file? If each of them create partitions, how will Spark know to avoid duplicates? – Sriram Rag Jul 29 '18 at 05:15
  • Here comes the Splittability of the file. In Hadoop, if the file is splittable, spark will launch parallel tasks to read each block ( block size is 64 / 129 MB). even in case of local FS, Spark is intelligent enough and launch the tasks to read the file in parallel if the file is splittable. – Lakshman Battini Jul 29 '18 at 05:32
  • for your 2nd query, each executor will not load the entire file, it will read only from the start and end of the block of the file. – Lakshman Battini Jul 29 '18 at 05:34