I have a > 5GB table on mysql. I want to load that table on spark as a dataframe and create a parquet file out of it.
This is my python function to do the job:
def import_table(tablename):
spark = SparkSession.builder.appName(tablename).getOrCreate()
df = spark.read.format('jdbc').options(
url="jdbc:mysql://mysql.host.name:3306/dbname?zeroDateTimeBehavior=convertToNull
",
driver="com.mysql.jdbc.Driver",
dbtable=tablename,
user="root",
password="password"
).load()
df.write.parquet("/mnt/s3/parquet-store/%s.parquet" % tablename)
I am running the following script to run my spark app:
./bin/spark-submit ~/mysql2parquet.py --conf "spark.executor.memory=29g" --conf "spark.storage.memoryFraction=0.9" --conf "spark.executor.extraJavaOptions=-XX:-UseGCOverheadLimit" --driver-memory 29G --executor-memory 29G
When I run this script on a EC2 instance with 30 GB, it fails with java.lang.OutOfMemoryError: GC overhead limit exceeded
Meanwhile, I am only using 1.42 GB of total memory available.
Here is full console output with stack trace: https://gist.github.com/idlecool/5504c6e225fda146df269c4897790097
I am not sure if I am doing something wrong or spark is not meant for this use-case. I hope spark is.