my prototype (written in R with the packages dplyr and tidyr) is hitting a wall in terms of computational complexity - even on my powerfull working station. Therefore, I want to port the code to Spark using Scala.
I looked up all transformations, actions, functions (SparkSQL) and column operations (also SparkSQL) and found all function equivalents except the one for the tidyr::spread()
function, available in R.
df %>% tidyr::spread(key = COL_KEY , value = COL_VAL)
basically spreads a key-value pair across multiple columns. E.g. the table
COL_KEY | COL_VAL
-----------------
A | 1
B | 1
A | 2
will be transformed to by
A | B
------------
1 | 0
0 | 1
2 | 1
In case there is no "out-of-the-box"-solution available: Could you point me in the right direction? Maybe a user defined function?
I'm free which Spark (and Scala) version to choose (therefore I'd go for the latest, 2.0.0
).
Thanks!