I have a 560 Mb csv file, i have read the csv from HDFS. When i checked the number of partitions of the file using df.rdd.partitions.size
it shows 4 partitions. If i just checked the total rows count using df.count()
an job is submitted with 2 stages and 5 tasks for all stages.
I need to understand how the total number of stage is 2 and total number of tasks for all the stages is 5. I have learnt that 1 task for each partition so it should be 4.
Thanks in Advance.