The difference is that complete.cases
returns a logical vector of the same length as the number of rows of the dataset while na.omit
removes row that have at least one NA. Using the reproducible example created below,
complete.cases(auto)
#[1] TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE
As we can see, it is a logical vector with no NAs. It gives TRUE
for rows that doesn't have any NAs. So, obviously, doing summary
on a logical vector returns no NA's.
summary(complete.cases(auto))
# Mode FALSE TRUE NA's
#logical 4 6 0
Suppose, we need to get the same result as the na.omit
, the logical vector derived should be used to subset the original dataset
autoN <- auto[complete.cases(auto),]
auto1 <- na.omit(auto)
dim(autoN)
#[1] 6 2
dim(auto1)
#[1] 6 2
Though, the results will be similar, na.omit
also returns some attributes
str(autoN)
#'data.frame': 6 obs. of 2 variables:
# $ v1: int 1 2 2 2 3 3
# $ v2: int 3 3 3 1 4 2
str(auto1)
#'data.frame': 6 obs. of 2 variables:
# $ v1: int 1 2 2 2 3 3
# $ v2: int 3 3 3 1 4 2
# - attr(*, "na.action")=Class 'omit' Named int [1:4] 2 7 8 10
# .. ..- attr(*, "names")= chr [1:4] "2" "7" "8" "10"
and would be slower compared to complete.cases
based on the benchmarks showed below.
Benchmarks
set.seed(238)
df1 <- data.frame(v1 = sample(c(NA, 1:9), 1e7, replace=TRUE),
v2 = sample(c(NA, 1:50), 1e7, replace=TRUE))
system.time(na.omit(df1))
# user system elapsed
# 2.50 0.19 2.69
system.time(df1[complete.cases(df1),])
# user system elapsed
# 0.61 0.09 0.70
data
set.seed(24)
auto <- data.frame(v1 = sample(c(NA, 1:3), 10, replace=TRUE),
v2 = sample(c(NA, 1:4), 10, replace=TRUE))