I am using Databricks and I am enjoying Autoloader feature. Basically, it is creating infrastructure to consume data in micro batch fashion. It works nice for the initial raw table (or name it bronze).
When I am a bit lost how to append my other tables - staging (or name it silver). The most complicated part is top thing about staging (silver) to dw layer (gold). Using MERGE command is a kind of the way, but in scale performance may degraded.
I am looking the best practices for accommodate Stream (microbatch) and batch for my Fact tables.
Just for viz I will add my cloud files configs:
raw_df = (spark
.readStream.format("cloudFiles")
.options(**cloudfile)
.load(raw_path)
)
Write with trigger option: (I want to schedule job with ADF).
autoloader_query = (raw_df.writeStream
.format("delta")
.trigger(once=True)
.option("checkpointLocation",checkpoint_path)
.partitionBy("p_date","p_hour")
.table("raw_table")
)
#Waiting end of autoloader
autoloader_query.awaitTermination()
#Show the output from the autoloader job
autoloader_query.recentProgress
I am looking for the best practices for stream to batch. Thank you!