4

In Spark 2.0 I can combine several file paths into a single load (see e. g. How to import multiple csv files in a single load?).

How can I achieve this with sparklyr's spark-read-csv?

Deepdelusion
  • 121
  • 1
  • 6

2 Answers2

3

Turns out the usage of wildcards for filepath in sparklyr is the same as SparkR so many folders can be combined into a single call.

Deepdelusion
  • 121
  • 1
  • 6
1

Code example to read several numbered CSV files in all sub folders of a specific folder on HDFS:

spark_read_csv(sc, path = "hdfs:///folder/subfolder_*/file[0-9].csv")

Note that depending on the size of the resulting object, you may want to set the parameter memory = FALSE.

Danilo Saft
  • 319
  • 3
  • 8
  • Is it possible to define a numerical range for the subfolders. For example my subfolders are only 2 digit numbers each indicating a day of the month {01, 02 ,03 ... 31} ? – Brad Feb 22 '21 at 01:44
  • I believe this is possible using the regular expressions like the [0-9] part in the path string above. You can build, test and adjust a regular expression to your needs using sites like https://regexr.com/ and just use it in the path parameter of the code posted above. – Danilo Saft Feb 19 '22 at 10:22