Loading large CSV files in Databricks (Pyspark)

Question

I need to load a CSV file that has a size of 500GB.

spark.read.csv("file.csv.gz")

takes hours. Is there a solution to speed that up?

score 1 · Accepted Answer · answered May 03 '23 at 10:59

If you have gzipped files, then it's expected, as such gzip files aren't splittable, and are handled with a single core. You have a choice of:

decompress file, so it will be simple CSV file - then it will be splittable and could be processed in parallel
try to use this custom code, but I'm not sure that it will work on Databricks without changes.

1 Answers1