I'm currently trying to bulk migrate the contents of a very large MySQL table into a parquet file via Spark SQL. But when doing so I quickly run out of memory, even when setting the driver's memory limit higher (I'm using spark in local mode). Example code:
Dataset<Row> ds = spark.read()
.format("jdbc")
.option("url", url)
.option("driver", "com.mysql.jdbc.Driver")
.option("dbtable", "bigdatatable")
.option("user", "root")
.option("password", "foobar")
.load();
ds.write().mode(SaveMode.Append).parquet("data/bigdatatable");
It seems like Spark tries to read the entire table contents into memory, which isn't going to work out very well. So, what's the best approach to doing bulk data migration via Spark SQL?