4

Is it possible to use Delta Live Tables to perform incremental batch processing?

Now, I believe that this code will always load all of the data available in the directory when a pipeline is run,

CREATE LIVE TABLE lendingclub_raw
COMMENT "The raw loan risk dataset, ingested from /databricks-datasets."
TBLPROPERTIES ("quality" = "bronze")
AS SELECT * FROM parquet.`/databricks-datasets/samples/lending_club/parquet/`

But, if we do,

CREATE LIVE TABLE lendingclub_raw
COMMENT "The raw loan risk dataset, ingested from /databricks-datasets."
TBLPROPERTIES ("quality" = "bronze")
AS SELECT * cloud_files("/databricks-datasets/samples/lending_club/parquet/", "parquet")

Will it only load the incremental data each time it runs, if the pipeline is run in triggered mode?

I know that you can achieve batch incremental processing in Auto Loader by using the trigger mode .trigger(once=True) or .trigger(availableNow=True) and running the pipeline on a schedule.

Since you cannot exactly define a trigger in DLT, how will this work?

Minura Punchihewa
  • 1,498
  • 1
  • 12
  • 35

1 Answers1

1

You need to define your table as streaming live, so it will process only data that arrived since last invocation. From docs:

A streaming live table or view processes data that has been added only since the last pipeline update.

And then it could be combined with triggered execution that will behave similar to Trigger.AvailableNow. From docs:

Triggered pipelines update each table with whatever data is currently available and then stop the cluster running the pipeline. Delta Live Tables automatically analyzes the dependencies between your tables and starts by computing those that read from external sources. Tables within the pipeline are updated after their dependent data sources have been updated.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • 1
    thank you. So, if we don't define the tables as `STREAMING LIVE`, i.e. set them as 'Complete', all of the data available will be reprocessed each time? Is this kind of like the output modes that are available in Structured Streaming (append, complete) or is it a different idea? – Minura Punchihewa Jul 19 '22 at 09:56
  • I do not see `streaming live` option in the current documentation. Is there an alternative right now? – Vali Rosca Mar 31 '23 at 14:20
  • it looks like syntax has changed to simple "streaming table": https://docs.databricks.com/delta-live-tables/sql-ref.html#create-a-delta-live-tables-materialized-view-or-streaming-table, but old syntax should be still supported: https://github.com/databricks/delta-live-tables-notebooks/blob/main/sql/Retail%20Sales.sql#L9 – Alex Ott Mar 31 '23 at 17:32