3

I am trying to export a large mysql table having 350M records to parquet files in s3.

Following is the code I had tried:

    df = sparkSession.read.format('jdbc').options(
        url=db_url,
        driver='com.mysql.jdbc.Driver',
        dbtable='table_name',
        user=db_user,
        password=db_pwd,
        partitioncolumn='id',
        lowerbound=0,
        upperbound=1000000,
        numpartitions=10
    ).load()

df.write.parquet(output_path, mode='overwrite')

It runs for 25 mins on an EMR cluster with r5.2xlarge instances(1 Master, 10 Core and 10 Task) and it tails with error com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure.

Earlier I tried without numpartitions, lowerbound, upperbound and partitioncolumn options. That time too I got the same error. Based on similar issues reported before on StackOverflow, I tried with mentioned options and still the error exists.

Any help would be highly appreciated.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
sufinsha
  • 755
  • 6
  • 9

0 Answers0