1

I have a dataframe with 2500 rows. A few of the rows have NAs (an excessive number of NAs), and I want to remove those rows.

I've searched the SO archives, and come up with this as the most likely solution:

df2 <- df[df[, 12] != NA,]

But when I run it and look at df2, all I see is a screen full of NAs (and s).

Any suggestions?

rwjones
  • 377
  • 1
  • 5
  • 13
  • 2
    `df[ ! is.na( df[, 12] ) , ]` – Simon O'Hanlon Nov 13 '13 at 00:41
  • 1
    Define how many `NA`s per row constitute "an excessive number", then use `rowSums` and `is.na`. – A5C1D2H2I1M1N2O1R2T1 Nov 13 '13 at 01:33
  • 2
    If an "excessive number" is any .. then `?complete.cases`. Otherwise what Ananda said. – IRTFM Nov 13 '13 at 03:11
  • Right, I should have been more specific about "an excessive number." Actually, based on what I had, I wanted to delete any row with an NA anywhere. I ended up using Simon's method, and it worked. But I need to figure out -- and I will -- how to make it more general. Thanks. – rwjones Nov 13 '13 at 15:07

1 Answers1

7

Depending on what you're looking for, one of the following should help you on your way:

Some sample data to start with:

mydf <- data.frame(A = c(1, 2, NA, 4), B = c(1, NA, 3, 4), 
                   C = c(1, NA, 3, 4), D = c(NA, 2, 3, 4), 
                   E = c(NA, 2, 3, 4))
mydf
#    A  B  C  D  E
# 1  1  1  1 NA NA
# 2  2 NA NA  2  2
# 3 NA  3  3  3  3
# 4  4  4  4  4  4

If you wanted to remove rows just according to a few specific columns, you can use complete.cases or the solution suggested by @SimonO101 in the comments. Here, I'm removing rows which have an NA in the first column.

mydf[complete.cases(mydf$A), ]
#   A  B  C  D  E
# 1 1  1  1 NA NA
# 2 2 NA NA  2  2
# 4 4  4  4  4  4
mydf[!is.na(mydf[, 1]), ]
#   A  B  C  D  E
# 1 1  1  1 NA NA
# 2 2 NA NA  2  2
# 4 4  4  4  4  4

If, instead, you wanted to set a threshold--as in "keep only the rows that have fewer than 2 NA values" (but you don't care which columns the NA values are in--you can try something like this:

mydf[rowSums(is.na(mydf)) < 2, ]
#    A B C D E
# 3 NA 3 3 3 3
# 4  4 4 4 4 4

On the other extreme, if you want to delete all rows that have any NA values, just use complete.cases:

mydf[complete.cases(mydf), ]
#   A B C D E
# 4 4 4 4 4 4
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485