I noticed that in spark-shell (spark 2.4.4), when I do a simple spark.read.format(xyz).load("a","b","c",...)
, it looks like spark uses a single ipc client (or "thread") to load the files a, b, c, ... sequentially (they are path to hdfs).
Is this expected?
The reason I am asking is, for my case, I am trying to load 50K files, and the sequential load takes a long time.
Thanks
PS, I am trying to see it in the source code, but not sure if this is the one: https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L180