How to handle backpressure on databases when using Apache Spark?

Question

We are using Apache Spark for performing ETL for every 2 hours.

Sometimes Spark puts much pressure on databases when read/write operation is performed.

For Spark Streaming, I can see backpressure configuration on kafka.

Is there a way to handle this issue in batch processing?

score 5 · Accepted Answer · answered Nov 16 '18 at 14:05

Backpressure is actually just a fancy word to refer to setting up the max receiving rate. So actually it doesn't work the way you think it does.

What should be done here is actually on the reading end.

Now in classical JDBC usage, jdbc connectors have a fetchSize property for PreparedStatements. So basically you can consider configuring that fetchSize with regards of what is said in the following answers :

Unfortunately, this might not solve all of your performance issues with your RDBMS.

What you must know is that compared to the basic jdbc reader, which run on a single worker, when partitioning data using an integer column or using a sequence of predicates, loading data in a distributed mode but introduce a couple of problems. In your case, high number of concurrent reads can easily throttle the database.

To deal with this, I suggest the following :

If available, consider using specialized data sources over JDBC connections.
Consider using specialized or generic bulk import/export tools like Postgres COPY or Apache Sqoop.
Be sure to understand performance implications of different JDBC data source variants, especially when working with production database.
Consider using a separate replica for Spark jobs.

If you wish to know more about Reading data using the JDBC source, I suggest you read the following :

Spark SQL and Dataset API.

Disclaimer: I'm the co-author of that repo.

How to handle backpressure on databases when using Apache Spark?

1 Answers1