There are 40GB gzipped tsv files stored on S3.
I load it by using
df = spark.read.csv()
and store the df
on to the HDFS by
df.write.parquet()
The resultant size after that is 20 GB
But if I call repartition
on the DataFrame before storing it, the data size increases about 10x
df.repartition(num)
df.write.parquet()
Event I use repartition
and give the argument equal to the existing number of partitions, data size still increases a lot.
This makes the operation extremely slow.
But I do need the repartition
step because the sc.read.csv
doesn't return a reasonable partitioned DataFrame.
Anyone knows about this issue?