Input Data:
- a hive table (T) with 35 files (~1.5GB each, SequenceFile)
- files are in a gs bucket
- default fs.gs.block.size=~128MB
- all other parameters are default
Experiment 1:
- create a dataproc with 2 workers (4 core per worker)
- run select count(*) from T;
Experiment 1 Result:
- ~650 tasks created to read the hive table files
- each task read ~85MB data
Experiment 2:
- create a dataproc with 64 workers (4 core per worker)
- run select count(*) from T;
Experiment 2 Result:
- ~24,480 tasks created to read the hive table files
- each task read ~2.5MB data (seems to me 1 task read 2.5MB data is not a good idea as time to open the file would probably be longer than reading 2.5MB.)
Q1: Any idea how spark determines the number of tasks to read hive table data files? I repeated the same experiments by putting the same data in hdfs and I got similar results.
My understanding is that the number of tasks to read hive table files should be the same as the number of blocks in hdfs. Q2: Is that correct? Q3: Is that also correct when data is in gs bucket (instead of hdfs)?
Thanks in advance!