Best practice writing JavaRDD to external DB

Asked Apr 30 '20 at 20:04

Active Apr 30 '20 at 23:06

Viewed 124 times

I have a JavaRDD that I need to persist to some external DB.

What would be the best way to do it so I don't suffocate my DB with enormous number of connections? That is - I would like to have control over the number of Connection pools created in my Spark app.

I believe that rdd.forEach would be a bad option as it might create a connection pool for each row. I assume that rdd.foreachPartition is probably better but not quite sure.

edited Apr 30 '20 at 23:06

mazaneicha

8,794
4
33
52

asked Apr 30 '20 at 20:04

Vitali Melamud

1,267
17
40

1

Perhaps you can convert it to dataframe df, `df.repartition(x)` to x=number of connections your DB can tolerate, and then save using Spark JDBC datasource `jdbcDF.write .format("jdbc").option(...).save()`. https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html `foreachPartition()` would also work but... why? – mazaneicha Apr 30 '20 at 23:13
Thanks but unfortunately I can't use JDBC with my database... – Vitali Melamud May 01 '20 at 06:19
1

rdd.foreachPartition is better than rdd.foreach – tarun May 01 '20 at 06:51
_I can't use JDBC_ - then its probably not about Best practice. Will this post help https://stackoverflow.com/questions/30484701/apache-spark-foreach-vs-foreachpartitions-when-to-use-what – mazaneicha May 01 '20 at 12:15

Best practice writing JavaRDD to external DB

0 Answers0