1

I have a dataframe, books, and I'm trying to loop through all columns and return something like missing if that column has any missing values.

Below is my code. It returns what elements are missing. I then check if TRUE makes up any of those elements, suggesting that that is a missing element.

This works.

However, being new to R, I know there are better ways of doing this that I'm unaware of.

for (col in colnames(books)) {
  bool <- is.na(books[[col]])
  if (TRUE %in% bool) {
    print("Missing")
  } else {
    print("Fine")

  }
}
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • 1
    If it is just to identify columns that have NA `colSums(is.na(books)) > 0` or if you want to select columns that have. no NA `books %>% select(where(~ all(!is.na(.))))` – akrun Mar 10 '21 at 21:00
  • 1
    @akrun thank you! the `colsums()` code was exactly what I was after –  Mar 10 '21 at 21:04
  • Related: [Find names of columns which contain missing values](https://stackoverflow.com/questions/20364450/find-names-of-columns-which-contain-missing-values). Just skip the additional step of finding the names. – Henrik Mar 10 '21 at 21:17

5 Answers5

4

The anyNA function is built for this. You can apply it to all columns of a data frame with sapply(books, anyNA). To count NA values, akrun's suggestion of colSums(is.na(books)) is good.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Perfect thank you. I've been playing around with the `sapply` function using `range` and `mean`. But I had no idea there was a `anyNA` function –  Mar 10 '21 at 21:07
  • Just out of interest, can `sapply` be run on a single column by specifying the name of the column. Example, this doesn't work as I might expect `sapply(books$price, mean)` but this does `sapply(books[,4], mean)`. If we're assuming `price' is the 4th column. –  Mar 10 '21 at 21:15
  • 1
    `sapply` iterates - it applies a function to each column. You could iterate over a single column, but you need to give a 1-column data frame, not just a vector. `books['price']` or `books[, 'price', drop = FALSE]` will give 1-column data frames, where `books[['price']]`, `books[, 'price']` or `books$price` extract the column and give it to you not in a data frame.... – Gregor Thomas Mar 10 '21 at 21:26
  • 1
    So you could use `sapply(books['price'], anyNA)`, but there's not really any point in iterating over a single column---you'd do better to call the function directly on the extracted column: `anyNA(books$price)`. – Gregor Thomas Mar 10 '21 at 21:27
  • Ah okay I see how it works now. Yes `anyNA(books$price)` would make a lot more sense now that I think of it. Thanks for the help. –  Mar 10 '21 at 21:30
1

Using colSums on a logical matrix can count the number of TRUE (TRUE ->1 and FALSE -> 0). From there, create a logical vector with comparison operator (>)

colSums(is.na(books)) > 0 
akrun
  • 874,273
  • 37
  • 540
  • 662
1

The colSums answer by @akrun is super efficient. Here is another implementation for your purpose

seq(ncol(books)) %in% unique(which(is.na(books),arr.ind = TRUE)[,"col"])
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
0

the following code helped me a lot.

This function will show how many missing values are in any columns of your df

p <- function(x) {sum(is.na(x))/length(x)*100}
apply(df,2,p)

Here: 1. Find each missing value; 2. Create a vector with missing value; 3. Delete missing values from my df.

which(!complete.cases(df)) 
na_df <- which(!complete.cases(df)) 
df1 <- df[-na_df,]

In the last row, I create a new df "df1" with complete values.

All the best

LucaCoding
  • 65
  • 12
  • Don't use `apply` on data frame columns - it's inefficient because it converts the data frame to a matrix as a first step. `sapply(df, p)` or `lapply(df, p)` is better than `apply(df, 2, p)`. – Gregor Thomas Mar 10 '21 at 21:29
0

Another way to find them with dplyr library is:

mtcars %>%
  select(everything()) %>%  # replace to your needs
  summarise_all(funs(sum(is.na(.))))