AWS Glue - Truncate destination postgres table prior to insert

Question

I am trying to truncate a postgres destination table prior to insert, and in general, trying to fire external functions utilizing the connections already created in GLUE.

Has anyone been able to do so?

Possible duplicate of [Overwrite MySQL tables with AWS Glue](https://stackoverflow.com/questions/47556678/overwrite-mysql-tables-with-aws-glue) — hoaxz, Dec 18 '17 at 20:15
Did you manage to do this? I also want to do something like this. — lilline, Apr 25 '18 at 03:34

score 16 · Accepted Answer · answered Jun 22 '18 at 08:50

I've tried the DROP/ TRUNCATE scenario, but have not been able to do it with connections already created in Glue, but with a pure Python PostgreSQL driver, pg8000.

Download the tar of pg8000 from pypi
Create an empty __init__.py in the root folder
Zip up the contents & upload to S3
Reference the zip file in the Python lib path of the job
Set the DB connection details as job params (make sure to prepend all key names with --). Tick the "Server-side encryption" box.

Then you can simply create a connection and execute SQL.

import sys
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job

import pg8000

args = getResolvedOptions(sys.argv, [
    'JOB_NAME',
    'PW',
    'HOST',
    'USER',
    'DB'
])
# ...
# Create Spark & Glue context

job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# ...
config_port = 5432
conn = pg8000.connect(
    database=args['DB'], 
    user=args['USER'], 
    password=args['PW'],
    host=args['HOST'],
    port=config_port
)
query = "TRUNCATE TABLE {0};".format(".".join([schema, table]))
cur = conn.cursor()
cur.execute(query)
conn.commit()
cur.close()
conn.close()

Actually , I did the exact steps as you mentioned, but still I am getting "No module named pg8000". Can you help me where I went wrong. P.S I got lib from https://github.com/tlocke/pg8000 — Tharsan Sivakumar, Apr 24 '19 at 03:06
I'm also getting the same error - No module named pg8000. how did you overcome this? is there anything else need to be added? — Mahi, May 01 '19 at 19:51
While this is good, does this guarantee transaction? Assume there is a data frame write statement after the connection, will the table still be truncated if there is some error while writing the data frame to DB? @thenaturalist — user1401472, Jul 06 '20 at 10:07
To use the `pg8000` module with a Glue job is now simpler. Go to 'Security configuration, script libraries, and job parameters (optional)', scroll down to 'Job Parameters' and enter `--additional-python-modules` for the key and `pg8000` for the value.. and you're done :) — Brian Wylie, Jul 15 '21 at 20:35

score 1 · Answer 2 · answered May 15 '19 at 11:56

After following step (4) of @thenaturalist's response,

sc.addPyFile("/home/glue/downloads/python/pg8000.zip")

import pg8000

worked for me in a development endpoint (zeppelin notebook)

More info: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html

John B · Answer 3 · 2020-07-03T01:51:50.510

To clarify @thenaturalist instructions for the zip as I still struggled with this

Download the tar.gz of pg8000 from pypi.org and extract.
Zip the contents so you have the below structure

pg8000-1.15.3.zip
|
| -- pg8000 <dir>
    | -- __init__.py
    | -- _version.py <optional>
    | -- core.py

Upload to s3 and then you should be able to just do a simple import pg8000.

NOTE: scramp is also required at the moment so follow the same procedure as above to include the scramp module. But you don't need to import it.

score 1 · Answer 4 · answered Jun 25 '22 at 03:06


data=spark.sql(sql)
conf = glueContext.extract_jdbc_conf("jdbc-commerce")
data.write \
    .mode('overwrite') \
    .format("jdbc") \
    .option("url", conf['url']) \
    .option("database", 'Pacvue_Commerce') \
    .option("dbtable", "dbo.glue_1") \
    .option("user", conf['user']) \
    .option('truncate','true') \
    .option("password", conf['password']) \
    .save()

glue api not support , but spark api support.

jdbc-commerce is your connection name at crawl. use extract_jdbc_conf to get url、username and password.

For this to work, I had to add `.option("connschema", "")` to the options. — Andrey Kuehlkamp, Feb 15 '23 at 21:58

AWS Glue - Truncate destination postgres table prior to insert

4 Answers4

Linked