I'm learning AWS Glue. With traditional ETL a common pattern is to look up the primary key from the destination table to decide if you need to do an update or an insert (aka upsert design pattern). With glue there doesn't seem to be that same control. Plain writing out the dynamic frame is just a insert process. There are two design patterns I can think of how to solve this:
- Load the destination as data frame and in spark, left outer join to only insert new rows (how would you update rows if you needed to? delete then insert??? Since I'm new to spark this is most foreign to me)
- Load the data into a stage table and then use SQL to perform the final merge
It's this second method that I'm exploring first. How can I in the AWS world execute a SQL script or stored procedure once the AWS Glue job is complete? Do you do a python-shell job, lambda, directly part of glue, some other way?