2

I'm writing a large dataframe to a mysql database (Aurora on AWS RDS).

I'm doing roughly the following (pseudocode)

rdd1 = sc.textFile("/some/dir")
rdd2 = rdd.map(addSchema)
df = sqlContext.createDataFrame(rdd2)
df.write.jdbc(url="...", table="mydb.table", mode="append")

The dataframe is roughly 650,000 elements, and sometimes (yes, only sometimes) dies mid-insert, or at least I think that is what is happening.

In the stderr, there is a line somewhere toward the bottom saying the app is exiting with status 1, error. But there isn't any error anywhere aside from that final bit.

Is this known to be an unreliable method for writing large sets of data to a mysql database? how can I save my large dataframe to a mysql db without it dying so frequently?

Edit: spark 2.0, emr 5.0

Kristian
  • 21,204
  • 19
  • 101
  • 176
  • AFAIK `df.write.jdbc(url` is right way pls refer @zero323 's [answer](http://stackoverflow.com/questions/30983982/how-to-use-jdbc-source-to-write-and-read-data-in-pyspark). Also please see the reason for failure/more details and paste here. – Ram Ghadiyaram Oct 04 '16 at 04:05
  • also mention, version of spark, how many number of executors and other info. – Ram Ghadiyaram Oct 04 '16 at 04:07
  • 650K rows shouldn't be too bad unless of course something else is causing you problems (e.g. network stability between MySQL and your EMR cluster, MySQL database load, or table structure). Would you be able to provide more information about these? Also, try and see if what's suggested on this other thread helps: http://stackoverflow.com/questions/36912442/low-jdbc-write-speed-from-spark-to-mysql – Junjun Olympia Oct 05 '16 at 00:49

0 Answers0