Glue creates duplicates of records, how to fix it?

Question

Currently, we use Glue (python scripts) for data migration from MySQL database into RedShift database. Yesterday, we found an issue: some records are duplicates, these records have the same primary key which is used in MySQL database. According to our requirements, all data in RedShift database should be the same as in MySQL database.

I tried to remove a RedShift table before migration, but didn't find method for that...

Could you help me to fix the issue?

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "glue-db", table_name = "table", transformation_ctx = "datasource0")
applymapping0_1 = ApplyMapping.apply(frame = datasource0, mappings = [...], transformation_ctx = "applymapping0_1")
resolvechoice0_2 = ResolveChoice.apply(frame = applymapping0_1, choice = "make_cols", transformation_ctx = "resolvechoice0_2")
dropnullfields0_3 = DropNullFields.apply(frame = resolvechoice0_2, transformation_ctx = "dropnullfields0_3")
datasink0_4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields0_3, catalog_connection = "redshift-cluster", connection_options = {"dbtable": "table", "database": "database"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink0_4")

My solution is:

datasink0_4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields0_3, catalog_connection = "redshift-cluster", connection_options = {"dbtable": "mytable", "database": "mydatabase", "preactions": "delete from public.mytable;"}

score 0 · Answer 1 · answered Mar 08 '19 at 09:19

0

Redshift does not impose unique key constraints

Unless you can guarantee that your source scripts avoid duplicates then you need to run a regular job to de-duplicate on redshift,

delete from yourtable
where id in
(
select id
from yourtable
group by 1
having count(*) >1
)
;

Did you consider DMS as an alternative to Glue? This could work better for you.

answered Mar 08 '19 at 09:19

Jon Scott

4,144
17
29

thank you for your response! Currently, we use glue and in near future we can't change it... Maybe you know, where in the job script (I've added in the main post), I can put a query like "delete table table_name;" which will remove the migration table from redshift, before migration is started? – Luzifer Mar 08 '19 at 09:28
I do not think you can do that easily. see https://stackoverflow.com/questions/46228253/aws-glue-to-redshift-is-it-possible-to-replace-update-or-delete-data – Jon Scott Mar 08 '19 at 09:40

score 0 · Accepted Answer · answered Mar 08 '19 at 21:45

If your goal is not to have duplicates in destination table you can use postactions option for JBDC sink (see this answer for more details). Basically it allows to implement Redshift merge using staging table.

For your case it should be like this (replaces existing records):

post_actions = (
         "DELETE FROM dest_table USING staging_table AS S WHERE dest_table.id = S.id;"
         "INSERT INTO dest_table (id,name) SELECT id,name FROM staging_table;"
         "DROP TABLE IF EXISTS staging_table"
    )
datasink0_4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields0_3, catalog_connection = "redshift-cluster", connection_options = {"dbtable": "staging_table", "database": "database", "overwrite" -> "true", "postactions" -> post_actions}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink0_4")

Yuriy Bondaruk, thank you for your response! I applied a similar solution. "preactions": "delete from public.mytable;" — Luzifer, Mar 11 '19 at 09:42
@Luzifer Great! Glad that it helped you to solve your problem! Could you please mark the answer as accepted? — Yuriy Bondaruk, Mar 11 '19 at 13:38

Glue creates duplicates of records, how to fix it?

2 Answers2