4

How exactly can on create an efficient and reusable Databricks workflow for dumping raw SQL database into the Delta Lake. Some confusion here is for the best approach to the following:

  • Handling drift in schemas (columns within DB tables) => Doing a naive overwrite of the stored tables will do?
  • Capture Data Change (CDC) and merging existing data efficiently; say on ID. Does this also remain relevant for Relational Data Bases?
  • Is the Delta Live Table (DLT) format suitable for this?

One might imagine the following process:

  1. Iterating through the public tables information_schema:
table_names = spark.read.jdbc(url=jdbcUrl, table="information_schema.tables",
                               properties=connectionProperties) \
                               .filter("table_schema = 'public'") \
                               .select("table_name") \
                               .rdd.flatMap(lambda x: x) \
                               .collect()

for table in table_names:
    ...
  1. Then for each table:
  • (A) creating a new Delta Lake table if it doesn't exist (or possibly if out-of-date in terms of the schema), otherwise;
  • (B) merging new data/updated data into Delta Lake.

3rd party vendors like Airbyte and others offers this service -- not because it really should be this difficult to implement. But more likely due to the lack-luster documentation or reference implementation of this common process on the Databricks DLT/Delta Lake side.

A fulfilling answer would be (I) some background/verification on the (wrong?) assumptions contained within the OP, (II) the missing code for this workflow, and (III) answers/clarification to the 3 points raised.

LeOverflow
  • 301
  • 1
  • 2
  • 16

0 Answers0