I have a web service built around Spark that, based on a JSON request, builds a series of dataframe/dataset operations.
These operations involve multiple joins, filters, etc. that would change the ordering of the values in the columns. This final data set could have rows to the scale of millions.
Preferably without converting it to an RDD, is there anyway to apply a custom sort(s) on some columns of the final dataset based on the order of elements passed in as Lists?
The original dataframe is of the form
+----------+----------+
| Column 1 | Column 2 |
+----------+----------+
| Val 1 | val a |
+----------+----------+
| Val 2 | val b |
+----------+----------+
| val 3 | val c |
+----------+----------+
After a series of transformations are performed, the dataframe ends up looking like this.
+----------+----------+----------+----------+
| Column 1 | Column 2 | Column 3 | Column 4 |
+----------+----------+----------+----------+
| Val 2 | val b | val 999 | val 900 |
+----------+----------+----------+----------+
| Val 1 | val c | val 100 | val 9$#@ |
+----------+----------+----------+----------+
| val 3 | val a | val 2## | val $#@8 |
+----------+----------+----------+----------+
I now need to apply a sort on multiple columns based on the order of the values passed as an Array list.
For example:
Col1values Order=[val 1,val 3,val 2}
Col3values Order=[100,2##,999].