0

I've finally been introduced to parquet and am trying to understand it better. I realize that when running spark it is best to have at least as many parquet files (partitions) as you do cores to utilize spark to it's fullest. However, are there any advantages/disadvantages to making one large parquet file vs several smaller parquet files to store the data?

As a test I'm using this dataset:
https://nyc-tlc.s3.amazonaws.com/trip+data/fhvhv_tripdata_2021-01.parquet

This is the code I'm testing with:

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName('test') \
    .getOrCreate()
df = spark.read.parquet('fhvhv_tripdata_2021-01.parquet')
df.write.parquet('test.parquet')
df.write.parquet('./test')

When I ls -lh the files I see that: the test.parquet file is 4.0K enter image description here

and the two files created by writing to a directory are: 2.5K and 189M enter image description here

When I read these back into different dataframes they have the same count.

enter image description here

When is it best practice to do one over the other? What is the best practice to balance the file sizes when writing to a directory and should you? Any guidance/rules of thumb to use when writing/reading parquet files is greatly appreciated.

Michael
  • 749
  • 1
  • 8
  • 22
  • 1
    I believe test.parquet is directory containing files inside so can you please first check that Also while things vary for different cases but as you mentioned the number of files should be equal to number of cores and the reason we cannot have too many small files is because it will make read slower but only some large files will make parallelization harder so need to balance between both – Anjaneya Tripathi Jun 03 '22 at 14:33
  • You are correct. The .parquet file is simply a directory. Looking closer it also contains .crc files with the meta data either way they are produced. In both cases one of the "real" .parquet files is 393M and the other is 2.6K. Is there a best practice to even out the data in each file? – Michael Jun 03 '22 at 20:11
  • 1
    In spark you can use repartition to break the files in nearly equal chunks and as suggested in databricks training you can pick number of cores and use that number to repartition your file ,as the default shuffle partition is set to 200 which is bit high unless lots of data is present – Anjaneya Tripathi Jun 03 '22 at 20:30
  • @AnjaneyaTripathi Do you want to put that as an answer and I will accept it? – Michael Jun 11 '22 at 13:07

1 Answers1

1

In spark you can use repartition to break the files in nearly equal chunks and as suggested in databricks training you can pick number of cores and use that number to repartition your file ,as the default shuffle partition is set to 200 which is bit high unless lots of data is present.

One specific gotcha with repartition is when your dataframe has complex data types and those have data in large variation of size for which you can refer to this question on stack

Anjaneya Tripathi
  • 1,191
  • 1
  • 3
  • 8
  • 1
    Just to clarify. If you write SomeFileName.parquet it is actually a directory which contains the parquet files. – Michael Jun 13 '22 at 13:05
  • 1
    That is the case with any type in spark not just parquet as spark is distributed system it always writes the data in part files and we cannot control final file names written so if try to define specific file name it creates directory with that name – Anjaneya Tripathi Jun 13 '22 at 13:06
  • Can we do something about that? Can we directly write parquet files rather than a directory? – lil-wolf Jul 08 '22 at 13:29
  • No, we cannot directly write as a file instead of a directory instead we can write in a single file using repartition(1)/coalesce(1) and then from the directory get that single file out you can get example code [here](https://stackoverflow.com/questions/31674530/write-single-csv-file-using-spark-csv) – Anjaneya Tripathi Jul 08 '22 at 15:11