I have a parquet directory having 5 files as shown below:
I am using Spark 2.2 version and reading this directory using below code:
I am not clear why 7 partitions (alternateDF.rdd().getNumPartitions()) are being determined by Spark when we have 5 files (each less than block size) in the parquet directory? 5 tasks have input records but the last 2 tasks have 0 input records but non-zero input data. Could you please explain the behavior of each task?