2

How does spark readStream code get triggered in Databricks AutoLoader.

I understand it is event driven process and a new file notification causes the file to be consumed.

Should the below code be run as a job? If that's the case, how is notifications useful?

What happens when the below code is executed? What is the sequence of steps that takes place in the processing of files using notification mechanism in Databricks?

Once I run the below code, the command gets completed in 2 minutes.

 df =  spark.readStream.format("cloudFiles")\
  .option("cloudFiles.useNotifications", True)\
  .option("cloudFiles.format", "csv")\
  .option("cloudFiles.connectionString", connection_string)\
  .option("cloudFiles.resourceGroup", resource_group)\
  .option("cloudFiles.subscriptionId", subscription_id)\
  .option("cloudFiles.tenantId", tenant_id)\
  .option("cloudFiles.clientId", client_id)\
  .option("cloudFiles.clientSecret", secret)\
  .option("cloudFiles.region", region)\
  .option("header", true)\
  .schema(dataset_schema)\
  .option("cloudFiles.includeExistingFiles", True)\
  .load(file_location)
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
learner
  • 833
  • 3
  • 13
  • 24

1 Answers1

1

Databricks Auto Loader can work in two modes:

  1. File listing mode - list files in the cloud storage, and compare it with the list of previously proecessed files to find what are the new files. If you have a lot of files the listing time could be very significant, especially for ADLS where we can get max 5k files in one response

  2. File notification mode. In this mode, underlying storage sends an event with the file name when it's created/modified and this event is queued for consumption (queue implementation is cloud specific). For the file notification it's just enough to read the queue to find what are the new files to ingest, and then Spark process just read these files without the need to list files.

Please note that file notification itself doesn't trigger the job - actual data processing happens only when the job is triggered (either periodically, or when it runs continuously). Although there will be an upcoming functionality that will provide ability to trigger a job when new files are landed in the monitored location.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132