I simply use Spark to transfer data from Mongo to HDFS, partitioning it by some field to store it in different folders by that field. I'm trying to understand if I should specify "maxRecordsPerFile" or somehow else divide one big file that is written by my job to each folder or should I just write one file per folder. I'm aware of HDFS blocks concept and that HDFS will split big file into blocks and so on. I'd like to know if there will be any difference between reading 1 huge file and 1000 not so huge(but still considerably bigger than block size) files. Code example:
import org.apache.spark.sql.functions._
dataset
.withColumn(YEAR_COLUMN, year(col(DATE_COLUMN)))
.withColumn(MONTH_COLUMN, month(col(DATE_COLUMN)))
.write
//.option("maxRecordsPerFile", 100000) or some other number to make files around 1GB
.mode(SaveMode.Append)
.partitionBy(YEAR_COLUMN, MONTH_COLUMN)
.json(OUTPUT_PATH)