How spark (2.3 or new version) determine the number of tasks to read hive table files in gs bucket or hdfs?

Question

Input Data:

a hive table (T) with 35 files (~1.5GB each, SequenceFile)
files are in a gs bucket
default fs.gs.block.size=~128MB
all other parameters are default

Experiment 1:

create a dataproc with 2 workers (4 core per worker)
run select count(*) from T;

Experiment 1 Result:

~650 tasks created to read the hive table files
each task read ~85MB data

Experiment 2:

create a dataproc with 64 workers (4 core per worker)
run select count(*) from T;

Experiment 2 Result:

~24,480 tasks created to read the hive table files
each task read ~2.5MB data (seems to me 1 task read 2.5MB data is not a good idea as time to open the file would probably be longer than reading 2.5MB.)

Q1: Any idea how spark determines the number of tasks to read hive table data files? I repeated the same experiments by putting the same data in hdfs and I got similar results.

My understanding is that the number of tasks to read hive table files should be the same as the number of blocks in hdfs. Q2: Is that correct? Q3: Is that also correct when data is in gs bucket (instead of hdfs)?

Thanks in advance!

Does this answer your question? [How does Spark SQL decide the number of partitions it will use when loading data from a Hive table?](https://stackoverflow.com/questions/44061443/how-does-spark-sql-decide-the-number-of-partitions-it-will-use-when-loading-data) — Igor Dvorzhak, Oct 17 '20 at 00:54

Dagang · Answer 1 · 2020-10-19T15:51:43.217

The number of tasks in one stage is equal to the number of partitions of the input data, which is in turn determined by the data size and the related configs (dfs.blocksize (HDFS), fs.gs.block.size (GCS), mapreduce.input.fileinputformat.split.minsize, mapreduce.input.fileinputformat.split.maxsize). For a complex query which involves multiple stages, it is the sum of the number of tasks of all stages.

There is no difference between HDFS and GCS, except they use different configs for block size, dfs.blocksize vs fs.gs.block.size.

See the following related questions:

How spark (2.3 or new version) determine the number of tasks to read hive table files in gs bucket or hdfs?

1 Answers1