This is most of a theoretical query per se, but directly linked to how I should create my files in HDFS. So, please bear with me for a bit.
I'm recently stuck on creating Dataframes for a set of data stored in parquet (snappy) files sitting on HDFS. Each parquet file is approximately 250+ MB
in size but the total number of files are around 6k
. Which I see as the reason of creating around 10K
tasks while creating the DF & obviously runs longer than expected.
I went through some posts where the explanation of the optimal parquet file size to be 1G
minimum has been given (https://forums.databricks.com/questions/101/what-is-an-optimal-size-for-file-partitions-using.html),
(Is it better to have one large parquet file or lots of smaller parquet files?).
I wanted to understand how Spark's processing is affected by the size of the files it is reading. More so, does HDFS block size & the file size greater or lesser than HDFS block size literally affects how spark partitions get created? If yes, then how; I need to understand the granular level details. If anyone has any specific & precise links to the context I'm asking of, it'd be a great help in understanding.