1

We have a Databricks job that has suddenly started to consistently fail. Sometimes it runs for an hour, other times it fails after a few minutes.

The inner exception is

ERROR MicroBatchExecution: Query [id = xyz, runId = abc] terminated with error
shaded.databricks.org.apache.hadoop.fs.azure.AzureException: hadoop_azure_shaded.com.microsoft.azure.storage.StorageException: Could not verify copy source.

The job targets a notebook which consumes from event-hub with PySpark structured streaming, calculates some values based on the data, and streams data back to another event-hub topic.

The cluster is a pool with 2 workers and 1 driver running on standard Databricks 9.1 ML.

We've tried to restart job many times, also with clean input data and checkpoint location. We struggle to determine what is causing this error. We cannot see any 403 Forbidden errors in logs, which is sometimes mentioned on forums as a reason . Any assistance is greatly appreciated.

  • Is your two storage accounts, source and destination of copy, *premium*? Refer this: https://stackoverflow.com/questions/36364529/could-not-verify-the-copy-source-within-the-specified-time-requestid-blank – Jim Todd Dec 16 '21 at 14:51

1 Answers1

1

Issue resolved by moving checkpointing (used internally by Spark) location from standard storage to premium. I don't know why it suddenly started failing after months of running hardly without hiccup. Premium storage might be a better place for checkpointing anyway since I/O is cheaper.