PYSPARK - How to read all csv files in all subfolders in S3?

Question

In Amazon S3 i have a folder with around 30 subfolders, in each subfolder contains one csv file.

I want a simple way to read each csv file from all the subfolders - currently, i can do this by specifying the path n times but i feel there must be a more concise way.

e.g. dataframe = sqlContext.read.csv([ path1, path2, path3,etc..], header=True)

also, if you need something more complex than a wildcard, this answer has more examples of what you can do: https://stackoverflow.com/a/31784292/5054505 — Patrick, May 02 '19 at 13:54

score 1 · Answer 1 · answered May 02 '19 at 16:40

1

Emulating your situation like this (using jupyter magic commands so you can see folder structure)

... just use * ... also assuming each csv has the same # of cols

! ls sub_csv/
print("="*10)
! ls sub_csv/csv1/
! ls sub_csv/csv2/
! ls sub_csv/csv3/
print("="*10)
! cat sub_csv/csv1/*.csv
! cat sub_csv/csv2/*.csv
! cat sub_csv/csv3/*.csv

csv1
csv2
csv3
==========
csv1.csv
csv2.csv
csv3.csv
==========
id
1
id
2
id
3

spark\
.read\
.option("header", "true")\
.csv("sub_csv/*")\
.show()

+---+
| id|
+---+
|  1|
|  2|
|  3|
+---+

answered May 02 '19 at 16:40

thePurplePython

2,621
1
13
34

I get the error message `IllegalArgumentException: u'java.net.URISyntaxException: Relative path in absolute URI: 2019-03-25T16:25:47.330010'` - any idea why? – Tim496 May 03 '19 at 09:58
not sure w/o seeing the command ... why are you using sqlContext and not sparkSession? – thePurplePython May 03 '19 at 20:31

PYSPARK - How to read all csv files in all subfolders in S3?

1 Answers1