0

I have a JavaRDD that I need to persist to some external DB.

What would be the best way to do it so I don't suffocate my DB with enormous number of connections? That is - I would like to have control over the number of Connection pools created in my Spark app.

I believe that rdd.forEach would be a bad option as it might create a connection pool for each row. I assume that rdd.foreachPartition is probably better but not quite sure.

mazaneicha
  • 8,794
  • 4
  • 33
  • 52
Vitali Melamud
  • 1,267
  • 17
  • 40
  • 1
    Perhaps you can convert it to dataframe df, `df.repartition(x)` to x=number of connections your DB can tolerate, and then save using Spark JDBC datasource `jdbcDF.write .format("jdbc").option(...).save()`. https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html `foreachPartition()` would also work but... why? – mazaneicha Apr 30 '20 at 23:13
  • Thanks but unfortunately I can't use JDBC with my database... – Vitali Melamud May 01 '20 at 06:19
  • 1
    rdd.foreachPartition is better than rdd.foreach – tarun May 01 '20 at 06:51
  • _I can't use JDBC_ - then its probably not about Best practice. Will this post help https://stackoverflow.com/questions/30484701/apache-spark-foreach-vs-foreachpartitions-when-to-use-what – mazaneicha May 01 '20 at 12:15

0 Answers0