How to look for updated rows when using AWS Glue?

Question

I'm trying to use Glue for ETL on data I'm moving from RDS to Redshift.

As far as I am aware, Glue bookmarks only look for new rows using the specified primary key and does not track updated rows.

However that data I am working with tends to have rows updated frequently and I am looking for a possible solution. I'm a bit new to pyspark, so if it is possible to do this in pyspark I'd highly appreciate some guidance or a point in the right direction. If there's a possible solution outside of Spark, I'd love to hear it as well.

Unfortunately Glue doesn't support bookmarking for RDS data source. Currently it works with Amazon S3 and for some data formats only (see https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html). — Yuriy Bondaruk, Jan 26 '19 at 00:30

score 4 · Accepted Answer · answered Jan 26 '19 at 01:53

4

You can use the query to find the updated records by filtering data at source JDBC database as shown below example. I have passed date as an argument so that for each run I can fetch only latest values from mysql database in this example.

query= "(select ab.id,ab.name,ab.date1,bb.tStartDate from test.test12 ab join test.test34 bb on ab.id=bb.id where ab.date1>'" + args['start_date'] + "') as testresult"

datasource0 = spark.read.format("jdbc").option("url", "jdbc:mysql://host.test.us-east-2.rds.amazonaws.com:3306/test").option("driver", "com.mysql.jdbc.Driver").option("dbtable", query).option("user", "test").option("password", "Password1234").load()

answered Jan 26 '19 at 01:53

Prabhakar Reddy

4,628
18
36

Thanks for the reply, want to clarify a few things. Would this work with job bookmark on? Or will this have to be a separate job? What would be the best way to pass the update date parameter to the job? I was thinking storing it on a file on s3 and referring to it whenever the job ran. Also, would it be possible to pass datasource0 to the rest of the default code generated by aws glue, or would some modifications need to be done there as well? – Suresh Kasipandy Jan 26 '19 at 07:54
@SureshKasipandy You don't need bookmark to be on for above code to work.This query will be executed at your database and fetch you the only rows which are updated. datasource0 is a spark data frame and if you want to leverage Glue related transformations then you need to convert this dataframe to dynamicframe using fromDF() method. – Prabhakar Reddy Jan 28 '19 at 16:21

How to look for updated rows when using AWS Glue?

1 Answers1

Linked