I am using the SQL API of Spark 2.0.0.
I would like to know what is the good practice when I have two independant actions that have to be done on my data. Here is a basic example :
val ds = sc.parallelize(List(
("2018-12-07T15:31:48Z", "AAA",3),
("2018-12-07T15:32:48Z", "AAA",25),
("2018-12-07T15:33:48Z", "AAA",20),
("2018-12-07T15:34:48Z", "AAA",10),
("2018-12-07T15:35:48Z", "AAA",15),
("2018-12-07T15:36:48Z", "AAA",16),
("2018-12-07T15:37:48Z", "AAA",8),
("2018-12-07T15:31:48Z", "BBB",15),
("2018-12-07T15:32:48Z", "BBB",0),
("2018-12-07T15:33:48Z", "BBB",0),
("2018-12-07T15:34:48Z", "BBB",1),
("2018-12-07T15:35:48Z", "BBB",8),
("2018-12-07T15:36:48Z", "BBB",7),
("2018-12-07T15:37:48Z", "BBB",6)
)).toDF("timestamp","tag","value")
val newDs = commonTransformation(ds).cache();
newDs.count() // force computation of the dataset
val dsAAA = newDs.filter($"tag"==="AAA")
val dsBBB = newDs.filter($"tag"==="BBB")
actionAAA(dsAAA)
actionBBB(dsBBB)
Using the following functions :
def commonTransformation(ds:Dataset[Row]):Dataset[Row]={
ds // do multiple transformations on dataframe
}
def actionAAA(ds:Dataset[Row]){
Thread.sleep(5000) // Sleep to simulate an action that takes time
ds.show()
}
def actionBBB(ds:Dataset[Row]){
Thread.sleep(5000) // Sleep to simulate an action that takes time
ds.show()
}
In this example, we have an input dataset that contains multiple time series identified by the 'tag' column. Some transofrmations are applied to this whole dataset.
Then, I want to apply different actions depending of the tag of the time series on my data.
In my example, I get the expected result, but I had to wait a long time for both my actions to get executed, event though I had executors available.
I partialy solved the problem by using Java class Future, which allows me to start my actions in an asynchronous way. But with this solution, Spark become very slow if I start too much actions compared to his resources and end up taking more time than doing the actions one by one.
So for now, my solution is to start multiple actions with a maximum limit of actions running at the same time, but I don't feel like it is the good way to do (and the maximum limit is hard to guess).