Batch loading into AWS RDS (postgres) from PySpark

Question

I am looking for a batch loader for a glue job to load into RDS using a PySpark script witht he DataFormatWriter. I have this working for RedShift as follows:

df.write \
    .format("com.databricks.spark.redshift") \
    .option("url", jdbcconf.get("url") + '/' + DATABASE + '?user=' + jdbcconf.get('user') + '&password=' + jdbcconf.get('password')) \
    .option("dbtable", TABLE_NAME) \
    .option("tempdir", args["TempDir"]) \
    .option("forward_spark_s3_credentials", "true") \
    .mode("overwrite") \
    .save()

Where df is defined above to read in a file. What is the best approach I could take to do this in RDS instead of in REDSHIFT?

score 1 · Answer 1 · answered Sep 09 '18 at 11:00

In RDS would you be only APPEND / OVERWRITE, in such case you can create an RDS JDBC connection, and use something like below:

postgres_url="jdbc:postgresql://localhost:portnum/sakila?user=<user>&password=<pwd>"
df.write.jdbc(postgres_url,table="actor1",mode="append") #for append
df.write.jdbc(postgres_url,table="actor1",mode="overwrite") #for overwrite

If it involves UPSERTS, then probably you can use a MYSQL library as an external python library, and perform INSERT INTO ..... ON DUPLICATE KEY.

Please refer this url: How to use JDBC source to write and read data in (Py)Spark?

regards

Yuva

score 0 · Answer 2 · answered Sep 10 '18 at 22:36

I learned that this can be only done through JDBC. Eg.

df.write.format("jdbc") \
    .option("url", jdbcconf.get("url") + '/' + REDSHIFT_DATABASE + '?user=' + jdbcconf.get('user') + '&password=' + jdbcconf.get('password')) \
    .option("dbtable", REDSHIFT_TABLE_NAME) \
    .option("tempdir", args["TempDir"]) \
    .option("forward_spark_s3_credentials", "true") \
    .mode("overwrite") \
    .save()

Batch loading into AWS RDS (postgres) from PySpark

2 Answers2