I have a DAG that runs every hour and does incremental loads like :
select * from <table_name>
where last_mod_dt between <execution_date> AND <next_execution_date>;
---***execution_date here is the current dag instance execution date (1PM, 2PM...)
However, some hourly DAG's instance fails in between, hence want to ensure that next task instance execution picks always from the last successful instance only; so that no delta records are missed from the source.
For example:- current scenario
DAG_Ins1 - 1 PM - Success (last_mod_dt between 1 PM AND 2 PM)
DAG_Ins2 - 2 PM - Fail (last_mod_dt between 2 PM AND 3 PM)
DAG_Ins3 - 3 PM - Fail (last_mod_dt between 3 PM AND 4 PM)
DAG_Ins4 - 4 PM - Sucesss (last_mod_dt between 4 PM AND 5 PM)
The last 4th instance which was success only picks current execution date (irrespective of the last status)
Expected:
DAG_Ins1 - 1 PM - Success (last_mod_dt between 1 PM AND 2 PM)
DAG_Ins2 - 2 PM - Fail (last_mod_dt between 2 PM AND 3 PM)
DAG_Ins3 - 3 PM - Fail (last_mod_dt between 2 PM AND 4 PM)
DAG_Ins4 - 4 PM - Sucesss (last_mod_dt between 2 PM AND 5 PM)
However, the last 4th instance should have picked date from the last successful execution end status which was 2 PM in the above example.
Don't want to query Airflow metadata tables. Do this method:- get_latest_execution_date() returns the last successful/failed date-time of a DAG? Any other {{macro}} that would do the needful ?
Thanks!!