2

Suppose I have data frame:

> df
         a    b  c    d  e
1  class 1   NA NA    M NA
2  class 2 0.60  3    F 12
3  class 3 0.40  4 <NA> 14
4  class 1   NA  5    F 67
5  class 1   NA NA <NA> 12
6  class 2 1.00 NA    F 22
7  class 1 0.45  6    M NA
8  class 1 1.20  7 <NA> NA
9  class 2   NA NA    M 34
10 class 2 1.30  1 <NA> 23
11 class 3 1.20  1    M 35
12 class 3 0.22 NA    F NA

I want to find the class in a corresponding to which values are missing: for example

corresponding to class 1 : 10 values are missing

corresponding to class 2 : 4 value is missing

and so on. In actual data I have one class variable and 35 predictors

I used:

>complete.cases(df)

This works but I want more detailed output in numbers. Because the actual data I am working on is very large.

Please help me.

Thank You

Learner27
  • 391
  • 1
  • 4
  • 13

2 Answers2

3

Part I, Your Original Data, Original Post:

How about negating the complete cases and then constructing a table from the output.

> (x <- df[!complete.cases(df),])
#         a  b
# 1 class 1 NA
# 4 class 1 NA
# 5 class 1 NA
# 9 class 2 NA
> table(x, useNA = "ifany")
#          b
# a         <NA>
#   class 1    3
#   class 2    1
#   class 3    0

Part II, Your Updated Data, Edited Post:

> cb <- cbind(df[1], isNA = rowSums(is.na(df[-1])))
> aggregate(isNA ~ a, cb, sum)
#         a isNA
# 1 class 1   10
# 2 class 2    4
# 3 class 3    3
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • Thanks Richard .. It worked but for a larger data its showing the following error : Error in table(x, useNA = "ifany") : attempt to make a table with >= 2^31 elements .... I have 36 variables in my data – Learner27 Sep 22 '14 at 20:56
  • @user3718501, please provide an example of how would you desired result will look like for lets say, 5 column data set – David Arenburg Sep 22 '14 at 21:01
  • What are you talking about? I copied nothing. Your base R solution uses `tapply`. Are you suggesting I copied `rowSums`? Because I think it should be quite obvious to you by now that I know how to use `rowSums` as well – Rich Scriven Sep 22 '14 at 22:13
  • Except you use `data.frame` and `df[,-1]`. Those are not the same – Rich Scriven Sep 22 '14 at 22:15
2

One very fast solution (specially designed for big data sets), could be using data.table

library(data.table)
setDT(df)[, list(SumNAs = sum(is.na(.SD))), by = a]

#          a SumNAs
# 1: class 1     10
# 2: class 2      4
# 3: class 3      3

Or with base R

df2 <- data.frame(a = df[, 1], freq = rowSums(is.na(df[, -1])))
with(df2, tapply(freq, a, sum))
## class 1 class 2 class 3 
##      10       4       3 

Edit Here are some benchmarks, as per OPs comment re big data set with many columns

set.seed(123)
n <- 1e5
df <- data.frame(a = sample(c("class 1", "class 2", "class 3"), n, replace = TRUE),
                 b = sample(c(1:6, NA), n, replace = TRUE),
                 c = sample(c(1:6, NA), n, replace = TRUE),
                 d = sample(c(1:6, NA), n, replace = TRUE),
                 e = sample(c(1:6, NA), n, replace = TRUE),
                 f = sample(c(1:6, NA), n, replace = TRUE),
                 j = sample(c(1:6, NA), n, replace = TRUE),
                 h = sample(c(1:6, NA), n, replace = TRUE),
                 i = sample(c(1:6, NA), n, replace = TRUE),
                 k = sample(c(1:6, NA), n, replace = TRUE),
                 l = sample(c(1:6, NA), n, replace = TRUE),
                 m = sample(c(1:6, NA), n, replace = TRUE),
                 n = sample(c(1:6, NA), n, replace = TRUE))
library(microbenchmark)
df2 <- copy(df)

davidDT <- function(x) setDT(x)[, list(SumNAs = sum(is.na(.SD))), by = a]

davidBaseR <- function(x){
  df2 <- data.frame(a = x[, 1], freq = rowSums(is.na(x[, -1])))
  with(df2, tapply(freq, a, sum)) 
}

RichardBaseR <- function(x){
  cb <- cbind(x[1], isNA = rowSums(is.na(x[-1])))
  aggregate(isNA ~ a, cb, sum)
}

microbenchmark(davidDT(df2), 
               davidBaseR(df),
               RichardBaseR(df),
               times = 100L)

# Unit: milliseconds
#             expr        min         lq     median         uq       max neval
#     davidDT(df2)   34.25858   36.91607   39.19706   41.18780  113.0531   100
#   davidBaseR(df)   32.75058   36.46721   43.01609   47.66303  199.7966   100
# RichardBaseR(df) 1429.29449 1469.32023 1521.38640 1631.51353 2525.2406   100
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • @David, Interesting benchmarks, but where's the "data.table" advantage here? :-) – A5C1D2H2I1M1N2O1R2T1 Sep 23 '14 at 03:42
  • (Actually, my benchmarks on my system show a better relative performance of "data.table" than is evident from your benchmark results--about twice as fast as `davidBaseR`, so I'm really curious.) – A5C1D2H2I1M1N2O1R2T1 Sep 23 '14 at 03:50
  • @AnandaMahto, you can replace my benchmarks with yours if you want. My point is that both of my solutions are by hundreds times more efficient (and original). So I have no idea what's going on with the votes count on this answer – David Arenburg Sep 23 '14 at 05:42
  • You got my vote (if it wasn't already evident by the timing of my comments and the upvote on this answer) :-). No idea why there's a -1 here though. Another thing that surprises me is that skipping the `data.frame` step in `davidBaseR` doesn't make that much of a difference in terms of benchmarking time (in other words, just doing `tapply(rowSums(is.na(df[-1])), df[[1]], sum)`. – A5C1D2H2I1M1N2O1R2T1 Sep 23 '14 at 05:50
  • @AnandaMahto, this is obvious. That because R doesn't create a copy when you do `df2 <- df1` if `df2` didn't previously exists, see my comments on [this question](http://stackoverflow.com/questions/25379761/can-r-do-operations-like-cumsum-in-place/), for example – David Arenburg Sep 23 '14 at 05:54
  • What's going on with the vote counts might be attributed to the fact that some people think that speed is not everything. But we all know voting is pretty inconsistent across the board on SO. (+1) David, because this answer is good and to prove I'm not the downvoter. :-) They're just meaningless internet points anyway... – Rich Scriven Sep 23 '14 at 16:24