Hope everyone is doing well.
I am trying to read a large number of JSON files i.e. around 150,000 files in a folder using Azure Databricks. Each file contains a single JSON i.e. 1 record per file. Currently the read process is taking over an hour to just read all the files despite having a huge cluster. The files are read using pattern as shown below.
val schema_variable = <schema>
val file_path = "src_folder/year/month/day/hour/*/*.json"
// e.g. src_folder/2022/09/01/10/*/*.json
val df = spark.read
.schema(schema_variable)
.json(file_path)
.withColumn("file_name", input_file_name())
Is there any approach or option we can try to make the reads faster.
We have already considered copying the file contents into a single file and then reading it, but we are losing lineage of file content i.e. which record came from which file.
I have also gone through various links in SO, but most of them seem to be around a single/multiple files of huge size say 10GB to 50GB.
Environment - Azure Databricks 10.4 Runtime.
Thank you for all the help.