I currently have an application which is supposed to connect to different types of databases, run a specific query on that database using Spark's JDBC options and then write the resultant DataFrame to HDFS.
The performance was extremely bad for Oracle (didn't check for all of them). Turns out it was because of the fetchSize
property which is 10 rows by default for Oracle. So I increased it to 1000 and the performance gain was quite visible. Then, I changed it to 10000 but then some of the tables started failing with an out of memory issue in the executor ( 6 executors, 4G memory each, 2G driver memory ).
My questions are :
Is the data fetched by Spark's JDBC persisted in executor memory for each run? Is there any way to un-persist it while the job is running?
Where can I get more information about the
fetchSize
property? I'm guessing it won't be supported by all JDBC drivers.Are there any other things that I need to take care which are related to JDBC to avoid OOM errors?