2

This is most of a theoretical query per se, but directly linked to how I should create my files in HDFS. So, please bear with me for a bit.

I'm recently stuck on creating Dataframes for a set of data stored in parquet (snappy) files sitting on HDFS. Each parquet file is approximately 250+ MB in size but the total number of files are around 6k. Which I see as the reason of creating around 10K tasks while creating the DF & obviously runs longer than expected.

I went through some posts where the explanation of the optimal parquet file size to be 1G minimum has been given (https://forums.databricks.com/questions/101/what-is-an-optimal-size-for-file-partitions-using.html), (Is it better to have one large parquet file or lots of smaller parquet files?).

I wanted to understand how Spark's processing is affected by the size of the files it is reading. More so, does HDFS block size & the file size greater or lesser than HDFS block size literally affects how spark partitions get created? If yes, then how; I need to understand the granular level details. If anyone has any specific & precise links to the context I'm asking of, it'd be a great help in understanding.

knowone
  • 840
  • 2
  • 16
  • 37

0 Answers0