We are using an an AWS Glue job to copy a table from Amazon RDS MySQL to Amazon S3.
We ran it once and it works well. How can we make sure it writes only the changes in the RDS table (changed or added rows) every day?
We are using an an AWS Glue job to copy a table from Amazon RDS MySQL to Amazon S3.
We ran it once and it works well. How can we make sure it writes only the changes in the RDS table (changed or added rows) every day?
You can try with AWS Glue job Bookmarks, if its always gonna be a new data only.
AWS Glue PySpark dynamic frame methods include an optional parameter named transformation_ctx
, this transformation_ctx parameter is used to identify state information within a job bookmark for the given operator.
For ex:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "database", table_name = "relatedqueries_csv", transformation_ctx = "datasource0")
Alternatively, you can use fetch whole data in spark sql, and do the left outer join with existing data. Left outer gives you newly added as well as modified records.
If you use bookmarks it will only fetch new records and bookmarks with JDBC will not be of any use if the table's primary keys are not in sequential order. Also, job bookmarks search for new rows, but not updated rows [1].
But you can always push the query down to db engine to calculate new records/modified records. Please refer to answer for example.