0

I am looking for a batch loader for a glue job to load into RDS using a PySpark script witht he DataFormatWriter. I have this working for RedShift as follows:

df.write \
    .format("com.databricks.spark.redshift") \
    .option("url", jdbcconf.get("url") + '/' + DATABASE + '?user=' + jdbcconf.get('user') + '&password=' + jdbcconf.get('password')) \
    .option("dbtable", TABLE_NAME) \
    .option("tempdir", args["TempDir"]) \
    .option("forward_spark_s3_credentials", "true") \
    .mode("overwrite") \
    .save()

Where df is defined above to read in a file. What is the best approach I could take to do this in RDS instead of in REDSHIFT?

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
AlexK
  • 336
  • 8
  • 21
  • 41

2 Answers2

1

In RDS would you be only APPEND / OVERWRITE, in such case you can create an RDS JDBC connection, and use something like below:

postgres_url="jdbc:postgresql://localhost:portnum/sakila?user=<user>&password=<pwd>"
df.write.jdbc(postgres_url,table="actor1",mode="append") #for append
df.write.jdbc(postgres_url,table="actor1",mode="overwrite") #for overwrite

If it involves UPSERTS, then probably you can use a MYSQL library as an external python library, and perform INSERT INTO ..... ON DUPLICATE KEY.

Please refer this url: How to use JDBC source to write and read data in (Py)Spark?

regards

Yuva

Yuva
  • 2,831
  • 7
  • 36
  • 60
0

I learned that this can be only done through JDBC. Eg.

df.write.format("jdbc") \
    .option("url", jdbcconf.get("url") + '/' + REDSHIFT_DATABASE + '?user=' + jdbcconf.get('user') + '&password=' + jdbcconf.get('password')) \
    .option("dbtable", REDSHIFT_TABLE_NAME) \
    .option("tempdir", args["TempDir"]) \
    .option("forward_spark_s3_credentials", "true") \
    .mode("overwrite") \
    .save()
AlexK
  • 336
  • 8
  • 21
  • 41