I have a Spark
program that starts to create the network of France (cities, local authorities...) in a dataset for a given year. It is then used for other operations : local accounting, searching among enterprises, etc.
The dataset is in term of business rules rather hard to create : many filters, checking of many kinds, and I don't know in advance how the caller who ask for it will use it. But most of the times, he asks for a dataset for the year 2019, because he only needs "All the cities existing in France today.".
My program below succeed in returning results for 2019.
Next caller also calls for cities of 2019 : Spark
restarts against the whole work he did before...
What is the principle of optimization here ?
Shall I store in my program, at the same level where I store the spark session I use for requests and building, something like a Map<Integer, Dataset>
where the key is the year, and the dataset the one that at least one caller has asked for this year ?