Read CSV files in parallel in pyspark

Asked Oct 16 '18 at 14:54

Active Oct 16 '18 at 14:54

Viewed 1,385 times

I am new to pyspark and a little confused about how to use it. I have a directory structure as following:

Main Directory

dir1 --> file1.csv, file2.csv, ...
dir2 --> file1.csv, file2.csv, ...
dir3 --> file1.csv, file2.csv, ...

I want to read and process these csv files with a parallel execution using SQLContext in pyspark. What I was trying to do was map directory names (dir1, dir2, ...) and call a worker function to process csv files inside that particular directory. But turns out I cannot have access to SQLContext inside the worker function to read csv file with a proper schema using pyspark.

Is there any solution to this problem? Or is there any other, more efficient approach I can take to do the same? I see that there are some questions answered about independent executions in pyspark like: How to run independent transformations in parallel using PySpark?

But my problem is I want to read csv files using SQLContext inside each worker function. Any help is appreciated.

asked Oct 16 '18 at 14:54

user8652313

How many files dir1, dir2 and dir3 have? Are they equal? – pvy4917 Oct 16 '18 at 15:08
@Prazy No, it can vary from directory by directory. – user8652313 Oct 16 '18 at 15:10
Can you follow the solution from eiTan LaVi: https://stackoverflow.com/questions/37639956/how-to-import-multiple-csv-files-in-a-single-load – pvy4917 Oct 16 '18 at 15:14
What version of spark you are using? – pvy4917 Oct 16 '18 at 15:16

Read CSV files in parallel in pyspark

Main Directory

0 Answers0