I would like to save a gzip file as a Hive table in Databricks, via the below PySpark commands:
df = spark.read.csv(".../Papers.txt.gz", sep="\t")
df.write.saveAsTable("...")
The gzip file Papers.txt.gz weights about 60GBs when unzipped (and it is a large .txt file, actually taken from here) and the Spark cluster is fairly large (850GB, 112 cores).
The problem is that it takes a very large amount of time until this is saved as a table (above 20 mins), making me to abort the operation out of fear that I will bring the cluster down.
The request seems pretty standard, but, is there something that I should be careful here?
Thank you in advance.