1

We want to listen to an Azure Event Hub and write the data to a Delta table in Azure Databricks. We have create

df = spark.readStream.format("eventhubs").options(**ehConf).load()
# Code omitted where message content is expanded into columns in the dataframe
df.writeStream \
  .format("delta") \
  .outputMode("append") \
  .option("checkpointLocation", "/tmp/delta/events/_checkpoints/") \
  .toTable("mydb.mytable")

This code works perfectly and the notebook stays on the dr.writeStream row until the job is canceled.

How should I set this up so it will run "eternally" and maybe even restart if the code crashes? Should I run it as a normal workflow and e.g. set it to run every minute but set Max concurrent runs to 1?

Mathias Rönnlund
  • 4,078
  • 7
  • 43
  • 96

1 Answers1

1

You don't need to schedule the job every minute - you'll get a lot of error runs that will fill your UI. Instead, you configure your job with max concurrent runs set to 1, no schedule, and have Max Retries (and maybe configure retry timeout) set to some big number, so in the case when job fails, the workflows manager will restart it automatically.

P.S. Instead of using Spark connector for EventHubs, consider use of the built-in Kafka connector - it's more performant and stable. This answer is about DLT, but really it works for normal code as well.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132