4

example sample data

Si      K       Ca      Ba  Fe  Type
71.78   0.06    8.75    0   0   1
72.73   0.48    7.83    0   0   1
72.99   0.39    7.78    0   0   1
72.61   0.57    na  0   0   na
73.08   0.55    8.07    0   0   1
72.97   0.64    8.07    0   na  1
73.09   na  8.17    0   0   1
73.24   0.57    8.24    0   0   1
72.08   0.56    8.3 0   0   1
72.99   0.57    8.4 0   0.11    1
na  0.67    8.09    0   0.24    1

we can load data into sparklyr with the following code

sdf_copy_to(sc,sampledata)

I am looking for a query that returns the columns having NA values for example like

si k ca fe
1  1  1 2
zero323
  • 322,348
  • 103
  • 959
  • 935
vijaynadal
  • 55
  • 5

1 Answers1

1

This problem is actually a bit tricky due to tbl_spark implementation and incompatibilities in Spark and R semantics. Even if could apply colSums, Spark SQL doesn't allow implicit conversions between booleans and numerics. This means you have to explicitly apply as.numeric:

library(dplyr)

sampledata <- copy_to(sc, data.frame(x=c(1, NA, 2), y=c(NA, 2, NA), z=42))

sampledata %>% 
  mutate_all(is.na) %>% 
  mutate_all(as.numeric) %>%
  summarize_all(sum)
# Source:   lazy query [?? x 3]
# Database: spark_connection
      x     y     z
  <dbl> <dbl> <dbl>
1     1     2     0
zero323
  • 322,348
  • 103
  • 959
  • 935
  • thank you so much and one more question i have to replace the null values with mean of column i tried the following command by using dplyr but it's showing error for example sampledata%>%mutate(k=ifelse(is.na(k),mean(k),k) – vijaynadal Nov 29 '17 at 13:30
  • Take a look at https://stackoverflow.com/q/43614220/6910411 and https://stackoverflow.com/q/40057563/6910411. Other than that it is a bit to complex to answer in the comment. – zero323 Nov 29 '17 at 13:35