3

I have a 560 Mb csv file, i have read the csv from HDFS. When i checked the number of partitions of the file using df.rdd.partitions.size it shows 4 partitions. If i just checked the total rows count using df.count() an job is submitted with 2 stages and 5 tasks for all stages.

I need to understand how the total number of stage is 2 and total number of tasks for all the stages is 5. I have learnt that 1 task for each partition so it should be 4.

Thanks in Advance.

Varadha31590
  • 371
  • 5
  • 20

1 Answers1

8

This is because count requires additional Stage. The first Stage reads your input file with 4 partitions (= 4 tasks), and each partition makes local row count. The second Stage has just one task which reads all previous counts (4 files) and sums them. So the entire Job has two Stages with 5 tasks total.

shay__
  • 3,815
  • 17
  • 34