4

I am using Databricks and I am enjoying Autoloader feature. Basically, it is creating infrastructure to consume data in micro batch fashion. It works nice for the initial raw table (or name it bronze).

When I am a bit lost how to append my other tables - staging (or name it silver). The most complicated part is top thing about staging (silver) to dw layer (gold). Using MERGE command is a kind of the way, but in scale performance may degraded.

I am looking the best practices for accommodate Stream (microbatch) and batch for my Fact tables.

Just for viz I will add my cloud files configs:

raw_df = (spark
          .readStream.format("cloudFiles")
          .options(**cloudfile)
          .load(raw_path)
         )

Write with trigger option: (I want to schedule job with ADF).

autoloader_query = (raw_df.writeStream
                 .format("delta")
                 .trigger(once=True)
                 .option("checkpointLocation",checkpoint_path)
                 .partitionBy("p_date","p_hour")
                 .table("raw_table")
                )

#Waiting end of autoloader
autoloader_query.awaitTermination()

#Show the output from the autoloader job
autoloader_query.recentProgress

I am looking for the best practices for stream to batch. Thank you!

  • it's not completely clear on what kind of processing do you want to perform for Silver & Gold... Usually Silver is 1:1 matching to Bronze, with data cleanup & enrichment applied – Alex Ott Oct 19 '21 at 06:00
  • Hi @AlexOtt it is a kind of similar question I've asked over email. Step 1 is bronze to many silver tables maybe should work with `read_stream` command. Step 2 each silver table has 1 or more fact tables, and I can't make append. MERGE is slow. Looking for the best practices. We can schedule some time if you like. – Dmitry Anoshin Oct 19 '21 at 17:39

0 Answers0