Pyspark 1.6: Poor performance when writing Dataframe to Mysql via JDBC

Question

I'm writing an python app that reads data from Cassandra, does some processing via Spark api, and then writes results to Mysql db via jdbc.

The whole code to process data via spark just takes several seconds but writing the last dataframe (with about 5000 rows) to mysql taking around 10 mins so I'm trying to figure out how to speed up that part.

This is the code that I'm using to write py spark dataframe to a mysql table:

df_name.write.jdbc(url=mysql_url.value,
table=tbl_name,
mode=mode.value,
properties={'user': mysql_user.value,
'password': mysql_pwd.value,
'driver': mysql_jdbc.value})

I allocated --executor-memory 4g --driver-memory 2g --total-executor-cores 6 to run my app.

The number of partitions for the data frame that I want to write to mysql is 5740; I followed several advises to reduce number of partitions (via .repartition() function) but it doesn't speed up the writing phase.

I also followed trick in [1] to set rewriteBatchedStatements=true and useServerPrepStmts=false but there is no improvement in performance yet.

Are there any settings or tricks I would try to improve speed when writing from dataframe to mysql via jdbc?

Thank you for your help.

[1] Low JDBC write speed from Spark to MySQL

P/S: Our Mysql db is in production, and we don't see performance issue in any other applications.

If 5000 reows take 10 minutes there is either something very wrong with your database configuration or (more likely) the problem is not really with JDBC writer. How do you create `df_name`? — Alper t. Turker, Apr 16 '18 at 21:46
I read data from cassandra to a dataframe, do some filtering and then store results to `df_name`; everything goes smoothly (taking only few seconds), the bottle neck occurs when calling `df_name.write.jdbc` since it takes about 10 mins. Our Mysql db is in production, and we don't see performance issue in any other applications. — lordelf, Apr 17 '18 at 12:39
_everything goes smoothly (taking only few seconds)_ - because everything before you call an action is lazy. Nothing happens there. — Alper t. Turker, Apr 17 '18 at 12:40
Simplest way to check - replace `write.jdbc` with `df_name.rdd.foreach(lambda _: None)`. — Alper t. Turker, Apr 17 '18 at 12:47
I tried to call `collect()` or writing to a file system with csv format; both take like `10 mins` too. — lordelf, Apr 17 '18 at 13:15
So like I said - problem is not related to JDBC at all. I'd suggest you either edit this question and include [mcve] or delete this one, and ask a new one, with all required details. — Alper t. Turker, Apr 17 '18 at 13:51

Pyspark 1.6: Poor performance when writing Dataframe to Mysql via JDBC

0 Answers0