2

I want have a dataframe with something like 90 variables, and over 1 million observations. I want to calculate the percentage of NA rows on each variable. I have the following code: sum(is.na(dataframe$variable) / nrow(dataframe) * 100) My question is, how can I apply this function to all 90 variables, without having to type all variable names in the code?

Jaap
  • 81,064
  • 34
  • 182
  • 193
  • 2
    `lapply(df, yourfunction)` – Jaap Nov 05 '15 at 16:12
  • 1
    Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). This will make it much easier for others to help you. – Jaap Nov 05 '15 at 16:13

2 Answers2

3

Use lapply() with your method:

lapply(df, function(x) sum(is.na(x))/nrow(df)*100)
maccruiskeen
  • 2,748
  • 2
  • 13
  • 23
3

If you want to return a data.frame rather than a list (via lapply()) or a vector (via sapply()), you can use summarise_each from the dplyr package:

library(dplyr)

df %>%
  summarise_each(funs(sum(is.na(.)) / length(.)))

or, even more concisely:

df %>% summarise_each(funs(mean(is.na(.)))) 

data

df <- data.frame(
  x = 1:10,
  y = 1:10,
  z = 1:10
)

df$x[c(2, 5, 7)] <- NA
df$y[c(4, 5)] <- NA
davechilders
  • 8,693
  • 2
  • 18
  • 18