Sparklyr : force allocation to use functions such as n_distinct, match

Asked Jul 02 '18 at 17:58

Active Jul 04 '18 at 14:41

Viewed 47 times

I have a Spark dataframe and am using sparklyr. I want to use functions such as n_distinct (available on dplyr) and match (e.g. to find the index of element e of column x in column y). Now I understand that it doesn't really fit with the very idea of parallel computing because if the different portions of the dataframe are treated separately it's hard to use functions such as n_distinct and match.

But I have a variable called group which defines groups and it's only within these groups that I want to use n_distinct and match ; so if I could find a way to tell Spark how to allocate the different rows to the different clusters (is it the right word ?) and to use the functions within the groups, it could work.

Is it possible to do such a thing ?

Thank you for the help !

edited Jul 04 '18 at 14:41

eliasah

39,588
11
124
154

asked Jul 02 '18 at 17:58

Vincent

And also [this](https://stackoverflow.com/a/49930818/8371915). – Alper t. Turker Jul 04 '18 at 14:42
`match` is not meaningful, because order is not guaranteed, so indices don't mean anything. But of course you can use `spark_apply`. – Alper t. Turker Jul 04 '18 at 14:43

Sparklyr : force allocation to use functions such as n_distinct, match

0 Answers0