1

Hope everyone is doing well.

I am trying to read a large number of JSON files i.e. around 150,000 files in a folder using Azure Databricks. Each file contains a single JSON i.e. 1 record per file. Currently the read process is taking over an hour to just read all the files despite having a huge cluster. The files are read using pattern as shown below.

val schema_variable = <schema>
val file_path = "src_folder/year/month/day/hour/*/*.json"
// e.g. src_folder/2022/09/01/10/*/*.json
val df = spark.read
  .schema(schema_variable)
  .json(file_path)
  .withColumn("file_name", input_file_name())

Is there any approach or option we can try to make the reads faster.

We have already considered copying the file contents into a single file and then reading it, but we are losing lineage of file content i.e. which record came from which file.

I have also gone through various links in SO, but most of them seem to be around a single/multiple files of huge size say 10GB to 50GB.

Environment - Azure Databricks 10.4 Runtime.

Thank you for all the help.

rainingdistros
  • 450
  • 3
  • 11
  • Have you tried reading in parallel (i.e. pasing all files in an array, rather than one at a time)? https://stackoverflow.com/a/60776993/361842 – JohnLBevan Sep 29 '22 at 14:31
  • 1
    @JohnLBevan, thank you for your response. When I mentioned `.json(file_path)`, the file_path value is passed as `.json("src_folder/year/month/day/hour/*/*.json")` I was hoping that spark would consider the file pattern and read in parallel. Is there any other alternative ? e.g. .json("src_folder/2022/09/01/10/*/*.json") – rainingdistros Sep 29 '22 at 14:37
  • Ah sorry, I missed the wildcards... I'm not sure if that's parallel or not (can't find anything in the docs), but would assume the same as you. – JohnLBevan Sep 29 '22 at 16:29

0 Answers0