I am trying to export a large mysql table having 350M records to parquet files in s3.
Following is the code I had tried:
df = sparkSession.read.format('jdbc').options(
url=db_url,
driver='com.mysql.jdbc.Driver',
dbtable='table_name',
user=db_user,
password=db_pwd,
partitioncolumn='id',
lowerbound=0,
upperbound=1000000,
numpartitions=10
).load()
df.write.parquet(output_path, mode='overwrite')
It runs for 25 mins on an EMR cluster with r5.2xlarge instances(1 Master, 10 Core and 10 Task) and it tails with error com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure.
Earlier I tried without numpartitions, lowerbound, upperbound and partitioncolumn options. That time too I got the same error. Based on similar issues reported before on StackOverflow, I tried with mentioned options and still the error exists.
Any help would be highly appreciated.