0

I have written a dataframe to a parquet file using spark that has 100 sub directory (each sub directory contains one files) on HDFS. This file has 100GB .

when I repartition the dataframe to 10 partition and write it to HDFS, the size of the output parquet files increases to about 200GB. why this is happend? what is the optimum number of partition when writing to a parquet file?

My question is diffrent from this question and I think It's not duplicate. That question maybe answer first part of my question although that's not the same (why is this happend?) but my main question is: what is the optimum number of partition when writing to a parquet file?

david_js
  • 43
  • 8
  • Possible duplicate of [Why are Spark Parquet files for an aggregate larger than the original?](https://stackoverflow.com/questions/38153935/why-are-spark-parquet-files-for-an-aggregate-larger-than-the-original) – user10938362 Jun 17 '19 at 07:38
  • It's not duplicate. That question maybe answer first part of my question although that's not the same (why is this happend?) but my main question is: what is the optimum number of partition when writing to a parquet file? – david_js Jun 17 '19 at 07:49

1 Answers1

2

It all comes down to use. It comes in two flavors is there a logical identifier in my data that will be consitently be searched upon for use or do I just care about file efficientcy.

(1) Logical identifier, if your data has a column(s) that are being used consitently (i.e. transaction time or input time) you can parttion along those lines, this will allow for your process to quickly parse the data allowing quicker query time. The downside to partitioning is that going over 2K is known to break technologies like Impala so don't go too crazy.

(2) Size partitioning, if you are looking at just optimizing file size for movement around the environment and other services/tools. I would advise trying to set the data size to 128MB per partition. This will allow for quicker movement over other tool that might have issues processing a series of smaller files (i.e. AWS S3). Below is some code for setting your partitions based on data size.

import org.apache.spark.sql.functions._ 
import org.apache.spark.sql.types._ 
import org.apache.spark.sql.DataFrame 
import org.apache.spark.util.SizeEstimator 

val inputDF2 : Long = SizeEstimator.estimate(inputDF.rdd) 
//find its appropiate number of partitions 
val numPartitions : Long = (inputDF2/134217728) + 1 
//write it out with that many partitions  
val outputDF = inputDF.repartition(numPartitions.toInt) 

Without knowing your data I cannot tell you if it would be better to partition by logical identified, by byte size, or a combination of both. I hope I gave you enough information to help you figure out what you want to do.

afeldman
  • 492
  • 2
  • 10
  • Thank you for the explanation, but you didn't explain how the size of 128MB for each partition of parquet file could minimize the final volume of the file. – david_js Jun 19 '19 at 05:20
  • Data Volume: Parquet runs off of Columnar Compression with the output data volume directly correlated with the organization or the file. The higher the organization the lower the data volume and visa versa. In essence you are going to want to separate the data by a logical identifier, using the commands partitionby or bucketby [Spark 2+]. The one issue is outside services will have a quagmire when trying to access all of the files (a great example of this is AWS S3). – afeldman Jun 19 '19 at 16:47
  • Block Size: The other solution is augmenting the block size to a pseudo-ceiling. The reason for this is if your processing will need high-level of processing of the full fill as well as other services (i.e. AWS, SQL Service, ...), they are generally more efficient when processing files at the block size 128 MB. This will not augment the data volume but it will assist in run time. – afeldman Jun 19 '19 at 16:47
  • Short answer: to reduce data volume run partition by or bucket by on a partition column. To efficiently build the data by partitions for outsize services set to a logical memory limit ceiling for the paritions. To answer your question about perfect number of partitions what are you more concerned about output data volume or processing/outside services? – afeldman Jun 19 '19 at 16:48