3

So I am using AWS pyspark, and have gigabytes of data everyday, which is getting updated. I want to find the id of the data in an existing table in glue database, update if the id already exists and insert if the id does not exist.

Is it possible to do it in AWS glue?

Thanks!

Paras Pandey
  • 37
  • 1
  • 4

2 Answers2

1

Yes, you can use the Glue Pyspark Extension for this.

data_sink = glue_context.getSink(
                    path="s3_path",
                    connection_type="s3",
                    updateBehavior="UPDATE_IN_DATABASE",
                    partitionKeys=['partition_column'],
                    compression="snappy",
                    enableUpdateCatalog=True,
                )
data_sink.setCatalogInfo(
                catalogDatabase=database_name,
                catalogTableName=table_name,
                )
data_sink.setFormat("glueparquet")
data_sink.writeFrame(data_frame)
Robert Kossendey
  • 6,733
  • 2
  • 12
  • 42
0

You can use Athena queries in the glue job to implement your logic. https://docs.aws.amazon.com/athena/latest/ug/querying-athena-tables.html

Vikram Rawat
  • 1,472
  • 11
  • 16