Spark data pipeline initial load impact on production DB

Question

I want to write a Spark pipeline to perform aggregation on my production DB data and then write data back to the DB. My goal of writing the pipeline is to perform aggregation and not impact production DB while it runs, meaning I don't want users experiencing lag nor DB having heavy IOPS while the aggregation is performed. For example, an equivalent aggregation query just run as SQL would take a long time and also use up the RDS IOPS, which results in users not being able to get data - trying to avoid this. A few questions:

How is data loaded into Spark (AWS Glue) in general? Is there query load on prod DB?
Is there a difference in using a custom SQL query vs custom Spark code to filter items initially (initial loading of data, e.g. load 30 days sales data)? For example, does using custom SQL query end up performing a query on the prod DB, resulting in large load on prod DB?
When writing data back to DB, does that incur load on DB as well?

I'm using a PostgreSQL database in case this matters.

score 0 · Answer 1 · answered Sep 12 '22 at 03:16

How is data loaded into Spark (AWS Glue) in general? Is there query load on prod DB?

By default there will be a single partition in Glue to which the whole table is read into.But you can do parallel reads using this and make sure to chose a column that will not affect the DB performance.

Is there a difference in using a custom SQL query vs custom Spark code to filter items initially (initial loading of data, e.g. load 30 days sales data)?

Yes when you pass a query instead of table you will be only reading the result of it from the DB and reducing the large n/w and IO transfer. This means you are delegating it to DB engine to calculate the result.Refer to this on how you can do it.

For example, does using custom SQL query end up performing a query on the prod DB, resulting in large load on prod DB?

Yes depending on the table size and query complexity this might affect DB performance and if you have a read replica then you can simply use that.

When writing data back to DB, does that incur load on DB as well?

Yes it depends on how you are writing the result back to DB. Few partitions is always good i.e, not too many and not too less.

Spark data pipeline initial load impact on production DB

1 Answers1