Reading files from multiple sub folders in sparklyr

Question

In Spark 2.0 I can combine several file paths into a single load (see e. g. How to import multiple csv files in a single load?).

How can I achieve this with sparklyr's spark-read-csv?

score 3 · Answer 1 · answered Oct 27 '17 at 20:04

3

Turns out the usage of wildcards for filepath in sparklyr is the same as SparkR so many folders can be combined into a single call.

answered Oct 27 '17 at 20:04

Deepdelusion

score 1 · Answer 2 · answered Nov 13 '20 at 18:15

1

Code example to read several numbered CSV files in all sub folders of a specific folder on HDFS:

spark_read_csv(sc, path = "hdfs:///folder/subfolder_*/file[0-9].csv")

Note that depending on the size of the resulting object, you may want to set the parameter memory = FALSE.

answered Nov 13 '20 at 18:15

Danilo Saft

Is it possible to define a numerical range for the subfolders. For example my subfolders are only 2 digit numbers each indicating a day of the month {01, 02 ,03 ... 31} ? – Brad Feb 22 '21 at 01:44
I believe this is possible using the regular expressions like the [0-9] part in the path string above. You can build, test and adjust a regular expression to your needs using sites like https://regexr.com/ and just use it in the path parameter of the code posted above. – Danilo Saft Feb 19 '22 at 10:22

2 Answers2