I need to load a CSV file that has a size of 500GB.
spark.read.csv("file.csv.gz")
takes hours. Is there a solution to speed that up?
I need to load a CSV file that has a size of 500GB.
spark.read.csv("file.csv.gz")
takes hours. Is there a solution to speed that up?
If you have gzipped files, then it's expected, as such gzip files aren't splittable, and are handled with a single core. You have a choice of:
decompress file, so it will be simple CSV file - then it will be splittable and could be processed in parallel
try to use this custom code, but I'm not sure that it will work on Databricks without changes.