Spark avoid execution of the entire query each time

Question

I have query that does a moving average over the beginning of time on data found in mysql db. Then I need to execute that query every day to use the previous day's value.

Instead of querying my database everytime I am using checkpoint to store the latest date computed so far. Then I am restoring the checkpoint to get the dataframe but I am getting all the data I used before including the latest date stored in a dataframe.

I just need a method to not have to re-execute my query on the whole mysql db and instead use the latest date's input or is that doable and recommended in spark.

df.checkpoint
RecoverCheckpoint.recover

I do not know if that is a good method since checkpoint is used for fault tolerance. Is there another way to achieve this?

Ref:

Spark Checkpointing Non-Streaming - Checkpoint files can be used in subsequent job run or driver program

score 0 · Accepted Answer · answered Oct 08 '20 at 19:11

0

You may like this https://dzone.com/articles/what-are-spark-checkpoints-on-dataframes as you will discover that for iterative algorithms this is also a necessary aspect. Some odd things to get to grips with.

In all honesty I would re-query as you refer to my original question and I would do things simple still. Good answer but no way would I implement that. You see the issue you are getting yourself.

answered Oct 08 '20 at 19:11

thebluephantom

16,458
8
40
83

would you say that storing today's date in a db (instead of checkpoint) and then run the sql query based on that date be a good idea? – Brend Oct 09 '20 at 03:49
easier and if you master that ans then leave some other person will need to get up to speed. maybe bucketBy is a better option – thebluephantom Oct 09 '20 at 06:29

Spark avoid execution of the entire query each time

1 Answers1