Best Approach to read large number of JSON Files (1 JSON per file) in Databricks

Question

Hope everyone is doing well.

I am trying to read a large number of JSON files i.e. around 150,000 files in a folder using Azure Databricks. Each file contains a single JSON i.e. 1 record per file. Currently the read process is taking over an hour to just read all the files despite having a huge cluster. The files are read using pattern as shown below.

val schema_variable = <schema>
val file_path = "src_folder/year/month/day/hour/*/*.json"
// e.g. src_folder/2022/09/01/10/*/*.json
val df = spark.read
  .schema(schema_variable)
  .json(file_path)
  .withColumn("file_name", input_file_name())

Is there any approach or option we can try to make the reads faster.

We have already considered copying the file contents into a single file and then reading it, but we are losing lineage of file content i.e. which record came from which file.

I have also gone through various links in SO, but most of them seem to be around a single/multiple files of huge size say 10GB to 50GB.

Environment - Azure Databricks 10.4 Runtime.

Thank you for all the help.

Have you tried reading in parallel (i.e. pasing all files in an array, rather than one at a time)? https://stackoverflow.com/a/60776993/361842 — JohnLBevan, Sep 29 '22 at 14:31
@JohnLBevan, thank you for your response. When I mentioned `.json(file_path)`, the file_path value is passed as `.json("src_folder/year/month/day/hour/*/*.json")` I was hoping that spark would consider the file pattern and read in parallel. Is there any other alternative ? e.g. .json("src_folder/2022/09/01/10/*/*.json") — rainingdistros, Sep 29 '22 at 14:37
Ah sorry, I missed the wildcards... I'm not sure if that's parallel or not (can't find anything in the docs), but would assume the same as you. — JohnLBevan, Sep 29 '22 at 16:29

Best Approach to read large number of JSON Files (1 JSON per file) in Databricks

0 Answers0