0

I am new to pyspark and a little confused about how to use it. I have a directory structure as following:

Main Directory

  • dir1 --> file1.csv, file2.csv, ...
  • dir2 --> file1.csv, file2.csv, ...
  • dir3 --> file1.csv, file2.csv, ...

I want to read and process these csv files with a parallel execution using SQLContext in pyspark. What I was trying to do was map directory names (dir1, dir2, ...) and call a worker function to process csv files inside that particular directory. But turns out I cannot have access to SQLContext inside the worker function to read csv file with a proper schema using pyspark.

Is there any solution to this problem? Or is there any other, more efficient approach I can take to do the same? I see that there are some questions answered about independent executions in pyspark like: How to run independent transformations in parallel using PySpark?

But my problem is I want to read csv files using SQLContext inside each worker function. Any help is appreciated.

0 Answers0