Is there a way to parallelize spark.read.load(string*) when reading many files?

Question

I noticed that in spark-shell (spark 2.4.4), when I do a simple spark.read.format(xyz).load("a","b","c",...), it looks like spark uses a single ipc client (or "thread") to load the files a, b, c, ... sequentially (they are path to hdfs).

Is this expected?

The reason I am asking is, for my case, I am trying to load 50K files, and the sequential load takes a long time.

Thanks

PS, I am trying to see it in the source code, but not sure if this is the one: https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L180

if you use spark-shell - by default spark-shell uses single thread & so it will process all files by single thread only. & try to run spark-shell --master yarn pass some executors & check if it is loading sequence or parallel. — Srinivas, May 19 '20 at 06:55
I recall spark-shell by default use all local cpu (in my case, 48). I also tried with --master yarn multiple executor numbers and it still single thread. My understanding is, since the "load" is only to load file info (not file content), so spark somehow just use one thread for that. — kcode2019, May 19 '20 at 07:10

score 0 · Answer 1 · answered May 28 '20 at 21:31

Might not be an exact "answer" to my original question, but I found out the reason for my particular case: from name node's audit log, it was found that there were some runaway jobs pegging name node, which greatly slowed down the rpc calls. After killing these bad jobs, the spark's load speed was greatly improved.

Is there a way to parallelize spark.read.load(string*) when reading many files?

1 Answers1