0

I need to load a CSV file that has a size of 500GB.

spark.read.csv("file.csv.gz") 

takes hours. Is there a solution to speed that up?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Lazloo Xp
  • 858
  • 1
  • 11
  • 36

1 Answers1

1

If you have gzipped files, then it's expected, as such gzip files aren't splittable, and are handled with a single core. You have a choice of:

  • decompress file, so it will be simple CSV file - then it will be splittable and could be processed in parallel

  • try to use this custom code, but I'm not sure that it will work on Databricks without changes.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132