1

I saved my dataframe as parquet format

df.write.parquet('/my/path')

When checking on HDFS, I can see that there is 10 part-xxx.snappy.parquet files under the parquet directory /my/path

My question is: is one part-xxx.snappy.parquet file correspond to a partition of my dataframe ?

super1ha1
  • 629
  • 1
  • 10
  • 17
  • I am not sure if this question might be duplicated, please let me know if there is already similar question – super1ha1 Mar 30 '20 at 02:30

2 Answers2

2

Yes, part-** files are created based on number of partitions in the dataframe while writing to HDFS.

To check number of partitions in the dataframe:

df.rdd.getNumPartitions()

To control number of files writing to filesystem we can use .repartition (or) .coalesce() (or) dynamically based on our requirement.

notNull
  • 30,258
  • 4
  • 35
  • 50
1

Yes, this creates one file per Spark-partition.

Note, that you can also partition files by some attribute:

df.write.partitionBy("key").parquet("/my/path")

in such case Spark is going to create up to Spark-partition number of files for each parquet-partition. Common way to reduce number of files in such case is to repartition data by key before writing (this effectively creates one file per partition).

bottaio
  • 4,963
  • 3
  • 19
  • 43