I'd like to tell in advance that several related questions, like the following, DO NOT address my problems:
- Spark query running very slow
- Converting mysql table to dataset is very slow...
- Spark Will Not Load Large MySql Table
- Spark MySQL Error while Reading from Database
This one comes close but the stack-trace is different and it is unresolved anyways. So rest assured that I'm posting this question after several days of (failed) solution-hunting.
I'm trying to write a job that moves data (once a day) from MySQL
tables to Hive
tables stored as Parquet
/ ORC
files on Amazon S3
. Some of the tables are quite big: ~ 300M records with 200 GB+ size (as reported by phpMyAdmin
).
Currently we are using sqoop
for this but we want to move to Spark
for the following reasons:
- To leverage it's capabilities with
DataFrame API
(in future, we would be performing transformations while moving data) - We already have a sizeable framework written in
Scala
forSpark
jobs used elsewhere in the organization
I've been able to achieve this on small MySQL
tables without any issue. But the Spark
job (that reads data from MySQL
into DataFrame
) fails if I try to fetch more than ~1.5-2M records at a time. I've shown the relevant portions of stack-trace below, you can find the complete stack-trace here.
...
javax.servlet.ServletException: java.util.NoSuchElementException: None.get
at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489)
at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
...
Caused by: java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
...
org.apache.spark.status.api.v1.OneStageResource.taskSummary(OneStageResource.scala:62)
at sun.reflect.GeneratedMethodAccessor188.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
...
[Stage 27:> (0 + 30) / 32]18/03/01 01:29:09 WARN TaskSetManager: Lost task 3.0 in stage 27.0 (TID 92, ip-xxx-xx-xx-xxx.ap-southeast-1.compute.internal, executor 6): java.sql.SQLException: Incorrect key file for table '/rdsdbdata/tmp/#sql_14ae_5.MYI'; try to repair it
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:964)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3973)
...
** This stack-trace was obtained upon failure of moving a 148 GB table containing 186M records
As apparent from (full) stack-trace, the Spark
read job starts sulking with the false warnings of None.get
error followed by SQLException: Incorrect key for file..
(which is related to MySQL
's tmp table becoming full)
Now clearly this can't be a MySQL
problem because in that case sqoop
should fail as well. As far as Spark
is concerned, I'm parallelizing the read operation by setting numPartitions = 32
(we use parallelism of 40 with sqoop
).
From my limited knowledge of Spark
and BigData
, 148 GB shouldn't be overwhelming for Spark by any measure. Moreover since MySQL
, Spark
(EMR
) and S3
all reside in same region (AWS
AP-SouthEast
), so latency shouldn't be the bottleneck.
My questions are:
- Is
Spark
a suitable tool for this? - Could
Spark
'sJdbc
driver be blamed for this issue? - If answer to above question is
- Yes: How can I overcome it? (alternate driver, or some other workaround)?
- No: What could be the possible cause?
Framework Configurations:
Hadoop
distribution: Amazon 2.8.3Spark
2.2.1Hive
2.3.2Scala
2.11.11
EMR
Configurations:
EMR
5.12.01 Master
: r3.xlarge [8 vCore, 30.5 GiB memory, 80 SSD GB storage EBS Storage:32 GiB]1 Task
: r3.xlarge [8 vCore, 30.5 GiB memory, 80 SSD GB storage EBS Storage:none]1 Core
: r3.xlarge [8 vCore, 30.5 GiB memory, 80 SSD GB storage EBS Storage:32 GiB]
** These are the configurations of development cluster; production cluster would be better equipped