In Spark 2.0 I can combine several file paths into a single load (see e. g. How to import multiple csv files in a single load?).
How can I achieve this with sparklyr's spark-read-csv?
In Spark 2.0 I can combine several file paths into a single load (see e. g. How to import multiple csv files in a single load?).
How can I achieve this with sparklyr's spark-read-csv?
Turns out the usage of wildcards for filepath in sparklyr is the same as SparkR so many folders can be combined into a single call.
Code example to read several numbered CSV files in all sub folders of a specific folder on HDFS:
spark_read_csv(sc, path = "hdfs:///folder/subfolder_*/file[0-9].csv")
Note that depending on the size of the resulting object, you may want to set the parameter memory = FALSE.