Identifying rows which all elements are missing values-R

Question

I have such a data frame(df) with missing values:

df:

head1   head2   head3
-----   -----   -----
34      32      6
NA      NA      NA
45      NA      11
54      15      98
45      56      NA
3       1       78
NA      5       NA

I want to return such a column(head4)

head1   head2   head3  head4
-----   -----   -----  -----
34      32      6      0
NA      NA      NA     1
45      NA      11     0
54      15      98     0
45      56      NA     0
3       1       78     0
NA      5       NA     0

Namely, If all elements of row are one missing value(NA), then related row will return 1 otherwise 0. How can I do that using R? I will be very glad for any help. Thanks a lot.

RHertel · Accepted Answer · 2016-03-28T09:06:04.960

You could try

df$head4 <- +(rowSums(is.na(df))==ncol(df))
#  head1 head2 head3 head4
#1    34    32     6     0
#2    NA    NA    NA     1
#3    45    NA    11     0
#4    54    15    98     0
#5    45    56    NA     0
#6     3     1    78     0
#7    NA     5    NA     0

In this case rowSums() counts the NA values in each row. If all entries in the row are NA, this sum is equal to the total number of columns of the data.frame and the comparison with ==ncol(df) returns TRUE. Else the result is FALSE. The Boolean vector can be coerced into numeric values (0/1) by adding the + sign in front, which is a short hand notation for as.numeric() in this case.

Hope this helps.

Since there has been a comment by @RichardTelford concerning the speed of the different answers, I tried to verify whether his claim according to which one of the other answers would be twice as fast as this one is true.

m <- matrix(runif(1e6),ncol=4)
nas <- sample(1e6,0.3*1.e6)
m[nas] <- NA
df <- as.data.frame(m)
library(microbenchmark)
frowsums <- function(x) {+(rowSums(is.na(x))==ncol(x))}
flapply <- function(x) {Reduce(`&`, lapply(x, is.na)) + 0L}
frowmeans <- function(x) {1*(rowMeans(is.na(x)) == 1)}
res <- microbenchmark(
  frowsums(df),
  flapply(df),
  frowmeans(df), times=1000L)
res  
Unit: milliseconds

          expr      min       lq     mean   median       uq      max neval cld
  frowsums(df) 15.75257 16.63475 20.23377 17.14405 17.82396 80.63485  1000   b
   flapply(df) 15.16721 15.23180 18.19778 16.13413 16.60948 88.92303  1000  a 
 frowmeans(df) 16.61643 17.56909 20.69433 18.03498 18.83867 81.54057  1000   b

As the results show, @RichardTelford's claim is not correct. There is hardly any difference in speed between the three solutions, which means that the simplest version and the one that is more easily understood should be preferable from a programmer's perspective.

David Arenburg · Answer 2 · 2016-03-27T16:32:24.313

4

I would suggest Reduce & lapply combination which avoids matrix conversions and copying the whole object into memory at once.

Reduce(`&`, lapply(df, is.na)) + 0L
# [1] 0 1 0 0 0 0 0

edited Mar 27 '16 at 16:32

answered Mar 27 '16 at 16:16

David Arenburg

91,361
17
137
196

microbenckmark shows this solution to be about twice as fast as the other two. The rowMeans solution is about 20% faster than the rowSums solution. – Richard Telford Mar 27 '16 at 17:39

score 3 · Answer 3 · answered Mar 27 '16 at 15:58

3

You can find NAs with is.na() and then test if all the elements in a row are 1 with the help of rowMeans()

df$head4 <- 1*(rowMeans(is.na(df)) == 1)

Multiplying by 1 coerces the logical vector to a numeric vector (you probably don't need to do this)

answered Mar 27 '16 at 15:58

Richard Telford

9,558
6
38
51

Identifying rows which all elements are missing values-R

3 Answers3