Discrepancy between complete.cases() and na.omit()

Question

Was messing around with the Auto dataset in R.

If I run the following:

auto = read.csv("Auto.csv", header=TRUE, na.strings="?")
summary(complete.cases(auto))

I get the following:

   Mode   FALSE    TRUE    NA's 
logical       5     392       0

However, when I run this, I get different results:

auto1 = na.omit(auto)
dim(auto)  # returns [1] 397   9
dim(auto1) # returns [1] 392   9

Why does complete.cases() tell me I have no NA's but na.omit() seems to be removing some entries?

score 6 · Accepted Answer · edited Jun 20 '20 at 09:12

The difference is that complete.cases returns a logical vector of the same length as the number of rows of the dataset while na.omit removes row that have at least one NA. Using the reproducible example created below,

complete.cases(auto)
#[1]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE

As we can see, it is a logical vector with no NAs. It gives TRUE for rows that doesn't have any NAs. So, obviously, doing summary on a logical vector returns no NA's.

summary(complete.cases(auto))
#  Mode   FALSE    TRUE    NA's 
#logical       4       6       0

Suppose, we need to get the same result as the na.omit, the logical vector derived should be used to subset the original dataset

autoN <- auto[complete.cases(auto),]
auto1 <- na.omit(auto)
dim(autoN)
#[1] 6 2
dim(auto1)
#[1] 6 2

Though, the results will be similar, na.omit also returns some attributes

str(autoN)
#'data.frame':   6 obs. of  2 variables:
# $ v1: int  1 2 2 2 3 3
# $ v2: int  3 3 3 1 4 2
str(auto1)
#'data.frame':   6 obs. of  2 variables:
# $ v1: int  1 2 2 2 3 3
# $ v2: int  3 3 3 1 4 2
# - attr(*, "na.action")=Class 'omit'  Named int [1:4] 2 7 8 10
#  .. ..- attr(*, "names")= chr [1:4] "2" "7" "8" "10"

and would be slower compared to complete.cases based on the benchmarks showed below.

Benchmarks

set.seed(238)
df1 <- data.frame(v1 = sample(c(NA, 1:9), 1e7, replace=TRUE),
              v2 = sample(c(NA, 1:50), 1e7, replace=TRUE))
system.time(na.omit(df1))
#  user  system elapsed 
#   2.50    0.19    2.69 
system.time(df1[complete.cases(df1),])
#  user  system elapsed 
#  0.61    0.09    0.70

data

set.seed(24)
auto <- data.frame(v1 = sample(c(NA, 1:3), 10, replace=TRUE), 
                   v2 = sample(c(NA, 1:4), 10, replace=TRUE))

When, then, would I use complete.cases over na.omit and vice versa? — Kevin Zakka, May 31 '16 at 05:13
@kevinzakka `na.omit` also returns some attributes and could be potentially slow compared to `complete.cases`. Other than that, both would give the same result as long as we dont' forget to subset the dataset with `complete.cases`. Please check my update. — akrun, May 31 '16 at 05:18

Discrepancy between complete.cases() and na.omit()

1 Answers1

Benchmarks

data

Linked