0

Could someone tell me how to read files in parallel? I'm trying something like this:

def processFile(path):
  df = spark.read.json(path)
  return df.count()

paths = ["...", "..."]

distPaths = sc.parallelize(paths)
counts = distPaths.map(processFile).collect()
print(counts)

It fails with the following error:

PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

Is there any other way to optimize this?

Marat Faskhiev
  • 1,282
  • 6
  • 17
  • 37
  • 1
    why not use threading? – Shubhanshu Mar 19 '20 at 12:33
  • @smx0 Could you point me any docs? – Marat Faskhiev Mar 19 '20 at 12:36
  • 1
    Might want to check https://stackoverflow.com/questions/19322079/threading-in-python-for-each-item-in-list – Shubhanshu Mar 19 '20 at 13:07
  • 1
    Does this answer your question? [How to run independent transformations in parallel using PySpark?](https://stackoverflow.com/questions/38048068/how-to-run-independent-transformations-in-parallel-using-pyspark) – user10938362 Mar 19 '20 at 13:07
  • @user10938362 I'm sorry. I'm a little bit new in this. It looks like in these examples code will be invoked on the same machine. Is there a way to distribute computation among machines in a cluster? – Marat Faskhiev Mar 19 '20 at 13:24
  • 1
    @mazaneicha Thank you. Could you add an answer? I'll mark it as answer. In my case, it was enough. I grouped sources by type and passed to spark.read.json – Marat Faskhiev Mar 20 '20 at 11:10

1 Answers1

3

In your particular case, you can just pass the whole paths array to DataFrameReader:

df = spark.read.json(paths)

...and reading its file elements will be parallelized by Spark.

mazaneicha
  • 8,794
  • 4
  • 33
  • 52