count number of missing values using spark_apply

Question

I have the next data frame called df

ci ing de
21 20 100
22 19 0
23 NA 80
24 100 NA
25 NA 50
26 50 30

and I want to count the number of missings of each column using spark.

I know that in R a code like this would work

apply(df, 2, 
            FUN = function (x) 
            { sum(is.na(x)) }  )

I want to do the same but using spark

Spark has a function called spark_apply, but I can't figure it out how to make it work.

I would clarify what is meant by 'missings'. If you are inferring how to count NULL values in Spark, there is a good post here on working with NULL. https://stackoverflow.com/questions/41533290/difference-between-null-and-isnull-in-spark-datadrame — jdev, Jul 20 '17 at 17:01

score 0 · Answer 1 · answered Aug 03 '17 at 09:49

0

Here "na" checking in df...

scala> nacount=df.na.count()
scala>nacount
2000

answered Aug 03 '17 at 09:49

R Palanivel-Tamilnadu India

442
2
7

score 0 · Answer 2 · edited May 19 '22 at 16:52

0

spark_apply(
  df,
  (function(e) sum(is.na(e)),
  names = c("your","column","names")
)

Try the above

edited May 19 '22 at 16:52

TylerH

20,799
66
75
101

answered Aug 14 '17 at 03:55

shashankp

63
1
7

returns 0 for class character and the sum of all NA's (over all cols) for class numeric --> that's not what's expected – nachti Mar 09 '18 at 11:48

score 0 · Answer 3 · answered Sep 15 '17 at 18:19

Not perfect but works for your purpose using spark_apply:

## count missing values by each column and group by category 
ci = c(21:26)
ing = c(20,19,NA,100,NA,50)
de = c(100,0,80,NA,50,30)
df = as.data.frame(list(ci=ci, ing=ing, de=de))
sdf = copy_to(sc, df)
count_na_col_i = function(i, sdf) {
  cns = colnames(sdf)
  cnt = spark_apply(sdf %>% select(cns[1], cns[i]) %>% mutate(x = cns[i]) %>% rename(y = cns[i]), #preparing data for spark_apply and renames as necessary
            f = function(tbl){
              require(dplyr)
              cn = as.character(collect(tbl %>% select("x") %>% distinct()))
              tbl %>% filter(is.na(y)) %>% count()
            }, columns = cns[i], group_by = cns[1])
  collect(cnt)
}
#i-th column only
i = 2
nna = count_na_col_i(2, sdf)
#all columns 
lapply(seq(2,length(colnames(sdf))), function(i, sdf) { count_na_col_i(i, sdf) }, sdf)

score 0 · Answer 4 · answered Mar 09 '18 at 12:00

Using @Charlie's sdf object:

sdf %>% spark_apply(function(e) apply(e, 2, function(x) sum(is.na(x))))

will do the job.

The result is a df with one col containing the number of NAs of each column of sdf in one row. If needed, you can transpose it (... %>% as.data.frame() %>% t()) and add the colnames manually.

# Source:   table<sparklyr_tmp_3f7f4665748e> [?? x 1]
# Database: spark_connection
     ci
  <int>
1     0
2     2
3     1

count number of missing values using spark_apply

4 Answers4