2

I simply use Spark to transfer data from Mongo to HDFS, partitioning it by some field to store it in different folders by that field. I'm trying to understand if I should specify "maxRecordsPerFile" or somehow else divide one big file that is written by my job to each folder or should I just write one file per folder. I'm aware of HDFS blocks concept and that HDFS will split big file into blocks and so on. I'd like to know if there will be any difference between reading 1 huge file and 1000 not so huge(but still considerably bigger than block size) files. Code example:

import org.apache.spark.sql.functions._
dataset
.withColumn(YEAR_COLUMN, year(col(DATE_COLUMN)))
.withColumn(MONTH_COLUMN, month(col(DATE_COLUMN)))
.write
//.option("maxRecordsPerFile", 100000) or some other number to make files around 1GB
.mode(SaveMode.Append)
.partitionBy(YEAR_COLUMN, MONTH_COLUMN)
.json(OUTPUT_PATH)
  • Does this answer your question? [Is it better to have one large parquet file or lots of smaller parquet files?](https://stackoverflow.com/questions/42918663/is-it-better-to-have-one-large-parquet-file-or-lots-of-smaller-parquet-files) – suj1th Nov 09 '20 at 15:27
  • Hello @suj1th, not exactly. That question is about difference between files around block size and files around 1 GB (which is better). The first answer just says to aim at 1 GB and about compression. The second answer explains why 1GB is better for parquet than smaller files. But what will be the downside of for example 10GB parquet file comparing to 1GB parquet. Correct me if I'm wrong, but I coudn't see that in the answers. – Nikita Poberezkin Nov 10 '20 at 07:18