I am interested in finding out how Spark creates partitions when loading a file from the local file system.
I am using the Databricks Community Edition to learn Spark. While I load a file that is just a few kilobytes in size (about 300 kb) using the sc.textfile command, spark, by default creates 2 partitions (as given by partitions.length). When I load a file that is about 500 MB, it creates 8 partitions (which is equal to the number of cores in the machine).
What is the logic here?
Also, I learnt from documentation that if we are loading from the local file system and using a cluster, the file has to be in the same location on all the machines that belong to the cluster. Will this not create duplicates? How does Spark handle this scenario? If you can point to articles that throw light on this, it will be of great help.
Thanks!