I have a Spark dataframe and am using sparklyr. I want to use functions such as n_distinct
(available on dplyr) and match
(e.g. to find the index of element e of column x in column y). Now I understand that it doesn't really fit with the very idea of parallel computing because if the different portions of the dataframe are treated separately it's hard to use functions such as n_distinct and match.
But I have a variable called group
which defines groups and it's only within these groups that I want to use n_distinct
and match
; so if I could find a way to tell Spark how to allocate the different rows to the different clusters (is it the right word ?) and to use the functions within the groups, it could work.
Is it possible to do such a thing ?
Thank you for the help !