Using is.na in R to get Column Names that Contain NA Values

Question

Given the example data set below:

df <- as.data.frame(matrix( c(1, 2, 3, NA, 5, NA, 
                              7, NA, 9, 10, NA, NA), nrow=2, ncol=6))

names(df) <- c(  "varA", "varB", "varC", "varD", "varE", "varF")

print(df)

  varA varB varC varD varE varF
1    1    3    5    7    9   NA
2    2   NA   NA   NA   10   NA

I'd like to be able to use kmeans(...) on data sets without having to manually check or delete variables that contain NA anywhere within the variable. While I'm asking right now for kmeans(...) I'll be using a similar process for other things, so a kmeans(...) specific answer won't totally answer my question.

The manual version of what I'd like is:

kmeans_model <- kmeans(df[, -c(2:4, 6)], 10)

And the pseudo-code would be:

kmeans_model <- kmeans(df[, -c(colnames(is.na(df)))], 10)

Also, I don't want to delete the data from df. Thanks in advance.

(Obviously kmeans(...) wouldn't work on this example data set but I can't recreate the real data set)

A very nearly duplicate: http://stackoverflow.com/q/11330138/324364 — joran, Aug 07 '14 at 16:55
Probably `df[, which(!sapply(df, function(col) sum(!is.na(col)) > 0))]` — lukeA, Aug 07 '14 at 16:58
@lukeA Using `any()` might be easier to read and slightly shorter. (Could be slightly slower, though, I'd have to check.) — joran, Aug 07 '14 at 17:05

talat · Accepted Answer · 2014-08-07T17:07:47.450

Here are two options without sapply:

kmeans_model <- kmeans(df[, !colSums(is.na(df))], 10)

Or

kmeans_model <- kmeans(df[, colSums(is.na(df)) == 0], 10)

Explanation:

colSums(is.na(df)) counts the number of NAs per column, resulting in:

colSums(is.na(df))
#varA varB varC varD varE varF 
#   0    1    1    1    0    2

And then

colSums(is.na(df)) == 0     # converts to logical TRUE/FALSE
#varA  varB  varC  varD  varE  varF 
#TRUE FALSE FALSE FALSE  TRUE FALSE

is the same as

!colSums(is.na(df))
#varA  varB  varC  varD  varE  varF 
#TRUE FALSE FALSE FALSE  TRUE FALSE

Both methods can be used to subset only those columns where the logical value is TRUE

Saurabh Jain · Answer 2 · 2017-11-30T10:29:23.150

2

This is the generic approach that I use for listing column names and their count of NAs:

sort(colSums(is.na(df)> 0), decreasing = T)

If you want to use sapply, you can refer this code snippet as well:

flights_NA_cols <- sapply(flights, function(x) sum(is.na(x))) 
flights_NA_cols[flights_NA_cols>0]

edited Nov 30 '17 at 10:29

answered Nov 30 '17 at 10:19

Saurabh Jain

1,600
1
20
30

Using is.na in R to get Column Names that Contain NA Values

2 Answers2

Explanation: