When writing a file to a Data Lake, specifically through Databricks, we have the option of specifying a partition column. This will save the data in separate folders (partitions) based on the values available in that column of the dataset.
At the same time, when we talk about Spark optimizations, we talk about partitioning the data.
What is the difference between these two? Is there a relationship between them?
From what I can understand, saving the data in the distributed file system in partitions will help when we want to read in only certain portion of the data (based on the partition column of course). For example, if we partition by color and we are only interested in the 'red' records, we can read in only that partition and ignore the rest. This results in some level of optimization when reading in the data.
Then, for Spark to perform parallel processing, this 'red' partition (from the file system) will be divided into partitions (Spark) based on the number of cores available in the cluster?
Is this correct? How does Spark decide the number of partitions? Is this number always equal to the number of cores in the cluster?
What is the idea of re-partitioning? I believe this involves the use of the coalesce()
and repartition()
functions. What causes Spark to re-partition data?