count number of unique elements in each columns with dplyr in sparklyr

Question

I'm trying to count the number of unique elements in each column in the spark dataset s.

However It seems that spark doesn't recognize tally() k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(tally(distinct(.))))) Error: org.apache.spark.sql.AnalysisException: undefined function TALLY

It seems that spark doesn't recognize simple r functions either, like "unique" or "length". I can run the code on local data, but when I try to run the exact same code on spark table it doesn't work.

```

d<-data.frame(cbind(seq(1,10,1),rep(1,10)))
d$group<-rep(c("a","b"),each=5)
d%>%group_by(group)%>%summarise_each(funs(length(unique(.))))
A tibble: 2 × 3
  group    X1    X2
  <chr> <int> <int>
1     a     5     1
2     b     5     1
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(length(unique(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function UNIQUE;

```

Possible duplicate of [number of unique values sparklyr](https://stackoverflow.com/q/49538717/6910411). — zero323, Apr 19 '18 at 21:07

Pasqui · Answer 1 · 2018-04-19T22:56:51.753

1

library(sparklyr)
library(dplyr)
#I am on Spark V. 2.1

#Building example input (local)
d <- data.frame(cbind(seq(1, 10, 1), rep(1,10)))
d$group <- rep(c("a","b"), each = 5)
d

#Spark tbl 
sdf <- sparklyr::sdf_copy_to(sc, d)

# The Answer
sdf %>% 
    group_by(group) %>% 
    summarise_all(funs(n_distinct)) %>%
    collect()

#Output
  group    X1    X2
  <chr> <dbl> <dbl>
1     b     5     1
2     a     5     1

NB: Given that we are using sparklyr I went for dplyr::n_distinct(). Minor: dplyr::summarise_each is deprecated. Thus, dplyr::summarise_all.

edited Apr 19 '18 at 22:56

answered Apr 19 '18 at 21:55

Pasqui

591
4
12

`summarise_each` is actually deprecated, with `summarise_all` being preferred since dplyr 0.5.0 – Zafar Apr 19 '18 at 22:19
1

@Zafar: thank you: my code was already correct, but I swapped the two in the final note. Now edited. – Pasqui Apr 19 '18 at 22:28
Thank you guys! I really appreciate it! – StatsBoy Apr 20 '18 at 13:23
@StatsBoy please consider to accept one of the answers – Pasqui Apr 20 '18 at 13:56

Zafar · Accepted Answer · 2018-04-19T22:17:17.193

-2

Remember when you are writing sparlyr you are really transpiling to spark-sql, so you may need to use spark-sql verbs from time to time. This is one of those times where spark-sql verbs like count and distinct come in handy.

library(sparkylr)

sc <- spark_connect()
iris_spk <- copy_to(sc, iris)

# for instance this does not work in R, but it does in sparklyr
iris_spk %>%
  summarise(Species = distinct(Species))
# or
iris_spk %>%
  summarise(Species = approx_count_distinct(Species))

# this does what you are looking for
iris_spk %>% 
    group_by(species) %>%
    summarise_all(funs(n_distinct))

# for larger data sets this is much faster
iris_spk %>% 
    group_by(species) %>%
    summarise_all(funs(approx_count_distinct))

edited Apr 19 '18 at 22:17

answered Apr 19 '18 at 22:12

Zafar

1,897
15
33

Thank you Zafar! I appreciate it! – StatsBoy Apr 20 '18 at 13:23
as Pasqui said, good idea to mark one of these as best answer :) – Zafar Apr 20 '18 at 23:23

count number of unique elements in each columns with dplyr in sparklyr

2 Answers2

Linked

Related