Pyspark: how to read multiple csv from different folders?

Question

I have two folders A and B

A contains file1.csv.gz and file2.csv.gz and B contains file2.csv.gz and file3.csv.gz

I would like to read those files in a unique dataframe.

This what I am doing:

folders_to_read = ["A/*.csv.gz", "B/*.csv.gz"]
df = spark.read.format('csv').option("header", "true").option("inferSchema", "true").\
  option("mode","DROPMALFORMED").load(i for i in folders_to_read)

But I get an error.

Py4JJavaError: An error occurred while calling o200.load.
: java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String

score 2 · Answer 1 · answered Apr 21 '20 at 00:16

It appears that the path passed to pyspark needs to be a string. It would've been nice of pyspark to accept a list of strings but it doesn't seem to do so. You can get around it by using a regex. It seems to work on my local machine.

Folder structure:

$ cd /Users/username/Downloads/tmp
$ tree
.
├── A
│   └── a.csv.gz
└── B
    └── b.csv.gz

Code:

x = spark.read.csv('/Users/username/Downloads/tmp/[AB]/*.csv.gz')
print(x)
DataFrame[_c0: string, _c1: string, _c2: string]

See this for more details on regex: How to use regex to include/exclude some input files in sc.textFile?

Pyspark: how to read multiple csv from different folders?

1 Answers1