10

I need write about 1 million rows from Spark a DataFrame to MySQL but the insert is too slow. How can I improve it?

Code below:

df = sqlContext.createDataFrame(rdd, schema)
df.write.jdbc(url='xx', table='xx', mode='overwrite')
Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114
Takashi Lee
  • 145
  • 1
  • 2
  • 10
  • Spark since 1.6.0 [supports batch inserts](https://issues.apache.org/jira/browse/SPARK-10040), so if you use older version - upgrade. If you can't upgrade for some reason, get RDD from your DataFrame and do batch insert by hand in foreachPartition loop. – Vitalii Kotliarenko Apr 28 '16 at 12:57
  • I'm using Spark 1.6.1 but the write speed is about 100 records/s. Can Dataframe repartition helpful here? – Takashi Lee Apr 28 '16 at 17:24
  • The default batch size is 1000. Maybe increasing it would speed things up? I'm giving that a try now! (You can set it through the `batchsize` property.) – Daniel Darabos Jul 18 '16 at 10:17
  • Where do you get `schema` from? @DanielDarabos – Gocht Dec 14 '16 at 20:47
  • A more typical example would be to load the DataFrame directly via `spark.read` or use `RDD.toDF`. – Daniel Darabos Dec 14 '16 at 23:02

1 Answers1

18

The answer in https://stackoverflow.com/a/10617768/3318517 has worked for me. Add rewriteBatchedStatements=true to the connection URL. (See Configuration Properties for Connector/J.)

My benchmark went from 3325 seconds to 42 seconds!

Community
  • 1
  • 1
Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114