Is one parquet files under the parquet folder a partition?

Question

I saved my dataframe as parquet format

df.write.parquet('/my/path')

When checking on HDFS, I can see that there is 10 part-xxx.snappy.parquet files under the parquet directory /my/path

My question is: is one part-xxx.snappy.parquet file correspond to a partition of my dataframe ?

I am not sure if this question might be duplicated, please let me know if there is already similar question — super1ha1, Mar 30 '20 at 02:30

score 2 · Accepted Answer · answered Mar 29 '20 at 17:03

Yes, part-** files are created based on number of partitions in the dataframe while writing to HDFS.

To check number of partitions in the dataframe:

df.rdd.getNumPartitions()

To control number of files writing to filesystem we can use .repartition (or) .coalesce() (or) dynamically based on our requirement.

score 1 · Answer 2 · answered Mar 29 '20 at 17:09

Yes, this creates one file per Spark-partition.

Note, that you can also partition files by some attribute:

df.write.partitionBy("key").parquet("/my/path")

in such case Spark is going to create up to Spark-partition number of files for each parquet-partition. Common way to reduce number of files in such case is to repartition data by key before writing (this effectively creates one file per partition).

Is one parquet files under the parquet folder a partition?

2 Answers2