Could someone tell me how to read files in parallel? I'm trying something like this:
def processFile(path):
df = spark.read.json(path)
return df.count()
paths = ["...", "..."]
distPaths = sc.parallelize(paths)
counts = distPaths.map(processFile).collect()
print(counts)
It fails with the following error:
PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Is there any other way to optimize this?