Here's the issue i want to solve : given a dataset as input i want to generate a list of datasets. the list of datasets of the input dataset is defined using the Min and Max values of a certain attribute that will make the list of sub-datasets by considering the Max & Min attribute values of a second dataset, here's an example of what i want : if we take as attribute Flight and the two following datasets :
1)
TicketId | Flight | time |
---------------------------------------|
10 | 123 | 2020-11-27 05:48:02|
---------------------------------------|
155 | 125 | 2020-11-27 05:49:02|
---------------------------------------|
12 | 133 | 2020-11-27 05:50:02|
---------------------------------------|
200 | 13 | 2020-11-27 06:49:02|
---------------------------------------|
123 | 22 | 2020-11-27 06:50:02|
---------------------------------------|
15 | 92 | 2020-11-27 05:51:02|
---------------------------------------|
21 | 41 | 2020-11-27 05:49:02|
---------------------------------------|
22 | 27 | 2020-11-27 05:50:02|
---------------------------------------|
422 | 35 | 2020-11-27 05:51:02|
---------------------------------------
And the second dataset is like the following :
2)
TicketId | Flight | time |
---------------------------------------|
103 | 156 | 2020-11-27 05:48:02|
---------------------------------------|
154 | 130 | 2020-11-27 05:49:02|
---------------------------------------|
123 | 151 | 2020-11-27 05:50:02|
---------------------------------------|
220 | 119 | 2020-11-27 06:49:02|
---------------------------------------|
143 | 111 | 2020-11-27 06:50:02|
---------------------------------------|
16 | 189 | 2020-11-27 05:51:02|
---------------------------------------|
22 | 152 | 2020-11-27 05:49:02|
---------------------------------------|
22 | 125 | 2020-11-27 05:50:02|
---------------------------------------|
134 | 187 | 2020-11-27 05:51:02|
---------------------------------------
Then given the Min value of dataset 2 according to the Flight attribute is 111 then the resuting list of datasets resulting from partitioning dataset 1 would be :
TicketId | Flight | time |
---------------------------------------|
10 | 123 | 2020-11-27 05:48:02|
---------------------------------------|
155 | 125 | 2020-11-27 05:49:02|
---------------------------------------|
12 | 133 | 2020-11-27 05:50:02|
---------------------------------------|
AND
TicketId | Flight | time |
---------------------------------------|
200 | 13 | 2020-11-27 06:49:02|
---------------------------------------|
123 | 22 | 2020-11-27 06:50:02|
---------------------------------------|
15 | 92 | 2020-11-27 05:51:02|
---------------------------------------|
21 | 41 | 2020-11-27 05:49:02|
---------------------------------------|
22 | 27 | 2020-11-27 05:50:02|
---------------------------------------|
422 | 35 | 2020-11-27 05:51:02|
---------------------------------------
Because the value Min of dataset 2 will split the dataset 1 accordingly into the two resulting datasets. My question is how to achieve that in Spark / Java (or even Scala). NB : the partitioning value (of the attribute Flight) could have been the Max value of the Attribute (of the dataset 2)
Thanks for the help.