11

I am trying to truncate a postgres destination table prior to insert, and in general, trying to fire external functions utilizing the connections already created in GLUE.

Has anyone been able to do so?

Josh Hamann
  • 123
  • 1
  • 1
  • 8
  • Possible duplicate of [Overwrite MySQL tables with AWS Glue](https://stackoverflow.com/questions/47556678/overwrite-mysql-tables-with-aws-glue) – hoaxz Dec 18 '17 at 20:15
  • 1
    Did you manage to do this? I also want to do something like this. – lilline Apr 25 '18 at 03:34

4 Answers4

16

I've tried the DROP/ TRUNCATE scenario, but have not been able to do it with connections already created in Glue, but with a pure Python PostgreSQL driver, pg8000.

  1. Download the tar of pg8000 from pypi
  2. Create an empty __init__.py in the root folder
  3. Zip up the contents & upload to S3
  4. Reference the zip file in the Python lib path of the job
  5. Set the DB connection details as job params (make sure to prepend all key names with --). Tick the "Server-side encryption" box.

Then you can simply create a connection and execute SQL.

import sys
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job

import pg8000

args = getResolvedOptions(sys.argv, [
    'JOB_NAME',
    'PW',
    'HOST',
    'USER',
    'DB'
])
# ...
# Create Spark & Glue context

job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# ...
config_port = 5432
conn = pg8000.connect(
    database=args['DB'], 
    user=args['USER'], 
    password=args['PW'],
    host=args['HOST'],
    port=config_port
)
query = "TRUNCATE TABLE {0};".format(".".join([schema, table]))
cur = conn.cursor()
cur.execute(query)
conn.commit()
cur.close()
conn.close()
thenaturalist
  • 787
  • 2
  • 8
  • 15
  • Actually , I did the exact steps as you mentioned, but still I am getting "No module named pg8000". Can you help me where I went wrong. P.S I got lib from https://github.com/tlocke/pg8000 – Tharsan Sivakumar Apr 24 '19 at 03:06
  • I'm also getting the same error - No module named pg8000. how did you overcome this? is there anything else need to be added? – Mahi May 01 '19 at 19:51
  • While this is good, does this guarantee transaction? Assume there is a data frame write statement after the connection, will the table still be truncated if there is some error while writing the data frame to DB? @thenaturalist – user1401472 Jul 06 '20 at 10:07
  • 2
    To use the `pg8000` module with a Glue job is now simpler. Go to 'Security configuration, script libraries, and job parameters (optional)', scroll down to 'Job Parameters' and enter `--additional-python-modules` for the key and `pg8000` for the value.. and you're done :) – Brian Wylie Jul 15 '21 at 20:35
1

After following step (4) of @thenaturalist's response,

sc.addPyFile("/home/glue/downloads/python/pg8000.zip")

import pg8000

worked for me in a development endpoint (zeppelin notebook)

More info: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html

Tejas
  • 11
  • 1
1

To clarify @thenaturalist instructions for the zip as I still struggled with this


Download the tar.gz of pg8000 from pypi.org and extract.
Zip the contents so you have the below structure

pg8000-1.15.3.zip
|
| -- pg8000 <dir>
    | -- __init__.py
    | -- _version.py <optional>
    | -- core.py

Upload to s3 and then you should be able to just do a simple import pg8000.

NOTE: scramp is also required at the moment so follow the same procedure as above to include the scramp module. But you don't need to import it.

John B
  • 1,129
  • 14
  • 23
1

data=spark.sql(sql)
conf = glueContext.extract_jdbc_conf("jdbc-commerce")
data.write \
    .mode('overwrite') \
    .format("jdbc") \
    .option("url", conf['url']) \
    .option("database", 'Pacvue_Commerce') \
    .option("dbtable", "dbo.glue_1") \
    .option("user", conf['user']) \
    .option('truncate','true') \
    .option("password", conf['password']) \
    .save()

glue api not support , but spark api support.

jdbc-commerce is your connection name at crawl. use extract_jdbc_conf to get url、username and password.

user2408678
  • 99
  • 1
  • 2