aws glue rds incremental load

Question

I am trying to load the data from the AWS RDS (MySQL) to the redshift using AWS glue. And I want to load the data incrementally. By using Job Bookmarks, glue can track only the newly added data but cant track the updated rows. Is there any way to load only the updated data? may be by using the field updated_at in the source table from MySQL?

score 2 · Answer 1 · answered Jun 07 '19 at 17:09

This is not possible using job bookmarks. From AWS documentation:

Job bookmarks are implemented for a limited use case for a relational database (JDBC connection) input source. For this input source, job bookmarks are supported only if the table's primary keys are in sequential order. Also, job bookmarks search for new rows, but not updated rows. This is because bookmarks look for the primary keys, which already exist.

https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

Glue will need to load the entirety of the RDS data into a Dynamic Frame or DataFrame. However, this data could be used to perform an upsert into the redshift database if what you're trying to avoid is truncating the redshift table and reloading all the data.

https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-upsert.html

score 1 · Answer 2 · answered Jun 09 '19 at 03:35

1

You can use the query to find the updated records by filtering data at source JDBC database as shown below example. I have passed date as an argument so that for each run I can fetch only latest values from mysql database in this example.

Please refer to this answer for example

answered Jun 09 '19 at 03:35

Prabhakar Reddy

4,628
18
36

Where is a good place to save the last processed timestamp? An s3 file? – Guilherme Torres Castro Oct 08 '19 at 18:55

aws glue rds incremental load

2 Answers2