1

There are 40GB gzipped tsv files stored on S3.

I load it by using

df = spark.read.csv()

and store the df on to the HDFS by

df.write.parquet()

The resultant size after that is 20 GB

But if I call repartition on the DataFrame before storing it, the data size increases about 10x

df.repartition(num)
df.write.parquet()

Event I use repartition and give the argument equal to the existing number of partitions, data size still increases a lot.

This makes the operation extremely slow.

But I do need the repartition step because the sc.read.csv doesn't return a reasonable partitioned DataFrame.

Anyone knows about this issue?

philantrovert
  • 9,904
  • 3
  • 37
  • 61
Qichu Gong
  • 61
  • 1
  • 1
    GZip is not a splittable codec in Hadoop. That's why you don't get a lot of partitions when you read the CSV. I don't think there's much you can do here apart from changing the type of your file or probably extracting it before you read it using Spark. – philantrovert Nov 24 '17 at 10:15
  • You need to review your partition size when writing in parquet ! This is quite broad to answer – eliasah Nov 24 '17 at 11:00

0 Answers0