How to read files in parallel in DataBricks?

Question

Could someone tell me how to read files in parallel? I'm trying something like this:

def processFile(path):
  df = spark.read.json(path)
  return df.count()

paths = ["...", "..."]

distPaths = sc.parallelize(paths)
counts = distPaths.map(processFile).collect()
print(counts)

It fails with the following error:

PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

Is there any other way to optimize this?

Might want to check https://stackoverflow.com/questions/19322079/threading-in-python-for-each-item-in-list — Shubhanshu, Mar 19 '20 at 13:07
Does this answer your question? [How to run independent transformations in parallel using PySpark?](https://stackoverflow.com/questions/38048068/how-to-run-independent-transformations-in-parallel-using-pyspark) — user10938362, Mar 19 '20 at 13:07
@user10938362 I'm sorry. I'm a little bit new in this. It looks like in these examples code will be invoked on the same machine. Is there a way to distribute computation among machines in a cluster? — Marat Faskhiev, Mar 19 '20 at 13:24
@mazaneicha Thank you. Could you add an answer? I'll mark it as answer. In my case, it was enough. I grouped sources by type and passed to spark.read.json — Marat Faskhiev, Mar 20 '20 at 11:10

score 3 · Accepted Answer · answered Mar 20 '20 at 15:25

3

In your particular case, you can just pass the whole paths array to DataFrameReader:

df = spark.read.json(paths)

...and reading its file elements will be parallelized by Spark.

answered Mar 20 '20 at 15:25

mazaneicha

8,794
4
33
52

How to read files in parallel in DataBricks?

1 Answers1

Linked