Checking all columns in data frame for missing values in R

Question

I have a dataframe, books, and I'm trying to loop through all columns and return something like missing if that column has any missing values.

Below is my code. It returns what elements are missing. I then check if TRUE makes up any of those elements, suggesting that that is a missing element.

This works.

However, being new to R, I know there are better ways of doing this that I'm unaware of.

for (col in colnames(books)) {
  bool <- is.na(books[[col]])
  if (TRUE %in% bool) {
    print("Missing")
  } else {
    print("Fine")

  }
}

If it is just to identify columns that have NA `colSums(is.na(books)) > 0` or if you want to select columns that have. no NA `books %>% select(where(~ all(!is.na(.))))` — akrun, Mar 10 '21 at 21:00
@akrun thank you! the `colsums()` code was exactly what I was after — , Mar 10 '21 at 21:04
Related: [Find names of columns which contain missing values](https://stackoverflow.com/questions/20364450/find-names-of-columns-which-contain-missing-values). Just skip the additional step of finding the names. — Henrik, Mar 10 '21 at 21:17

score 4 · Accepted Answer · answered Mar 10 '21 at 21:04

4

The anyNA function is built for this. You can apply it to all columns of a data frame with sapply(books, anyNA). To count NA values, akrun's suggestion of colSums(is.na(books)) is good.

answered Mar 10 '21 at 21:04

Gregor Thomas

136,190
20
167
294

Perfect thank you. I've been playing around with the `sapply` function using `range` and `mean`. But I had no idea there was a `anyNA` function – Mar 10 '21 at 21:07
Just out of interest, can `sapply` be run on a single column by specifying the name of the column. Example, this doesn't work as I might expect `sapply(books$price, mean)` but this does `sapply(books[,4], mean)`. If we're assuming `price' is the 4th column. – Mar 10 '21 at 21:15
1

`sapply` iterates - it applies a function to each column. You could iterate over a single column, but you need to give a 1-column data frame, not just a vector. `books['price']` or `books[, 'price', drop = FALSE]` will give 1-column data frames, where `books[['price']]`, `books[, 'price']` or `books$price` extract the column and give it to you not in a data frame.... – Gregor Thomas Mar 10 '21 at 21:26
1

So you could use `sapply(books['price'], anyNA)`, but there's not really any point in iterating over a single column---you'd do better to call the function directly on the extracted column: `anyNA(books$price)`. – Gregor Thomas Mar 10 '21 at 21:27
Ah okay I see how it works now. Yes `anyNA(books$price)` would make a lot more sense now that I think of it. Thanks for the help. – Mar 10 '21 at 21:30

score 1 · Answer 2 · answered Mar 10 '21 at 21:05

1

Using colSums on a logical matrix can count the number of TRUE (TRUE ->1 and FALSE -> 0). From there, create a logical vector with comparison operator (>)

colSums(is.na(books)) > 0

answered Mar 10 '21 at 21:05

akrun

874,273
37
540
662

score 1 · Answer 3 · answered Mar 10 '21 at 21:13

1

The colSums answer by @akrun is super efficient. Here is another implementation for your purpose

seq(ncol(books)) %in% unique(which(is.na(books),arr.ind = TRUE)[,"col"])

answered Mar 10 '21 at 21:13

ThomasIsCoding

96,636
9
24
81

score 0 · Answer 4 · answered Mar 10 '21 at 21:14

0

the following code helped me a lot.

This function will show how many missing values are in any columns of your df

p <- function(x) {sum(is.na(x))/length(x)*100}
apply(df,2,p)

Here: 1. Find each missing value; 2. Create a vector with missing value; 3. Delete missing values from my df.

which(!complete.cases(df)) 
na_df <- which(!complete.cases(df)) 
df1 <- df[-na_df,]

In the last row, I create a new df "df1" with complete values.

All the best

answered Mar 10 '21 at 21:14

LucaCoding

65
12

Don't use `apply` on data frame columns - it's inefficient because it converts the data frame to a matrix as a first step. `sapply(df, p)` or `lapply(df, p)` is better than `apply(df, 2, p)`. – Gregor Thomas Mar 10 '21 at 21:29

score 0 · Answer 5 · answered May 19 '22 at 17:56

0

Another way to find them with dplyr library is:

mtcars %>%
  select(everything()) %>%  # replace to your needs
  summarise_all(funs(sum(is.na(.))))

answered May 19 '22 at 17:56

Elvin Aghammadzada

763
7
21

Checking all columns in data frame for missing values in R

5 Answers5