I want have a dataframe with something like 90 variables, and over 1 million observations. I want to calculate the percentage of NA rows on each variable. I have the following code: sum(is.na(dataframe$variable) / nrow(dataframe) * 100) My question is, how can I apply this function to all 90 variables, without having to type all variable names in the code?
Asked
Active
Viewed 1,002 times
2
-
2`lapply(df, yourfunction)` – Jaap Nov 05 '15 at 16:12
-
1Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). This will make it much easier for others to help you. – Jaap Nov 05 '15 at 16:13
2 Answers
3
Use lapply()
with your method:
lapply(df, function(x) sum(is.na(x))/nrow(df)*100)

maccruiskeen
- 2,748
- 2
- 13
- 23
3
If you want to return a data.frame
rather than a list (via lapply()
) or a vector (via sapply()
), you can use summarise_each
from the dplyr
package:
library(dplyr)
df %>%
summarise_each(funs(sum(is.na(.)) / length(.)))
or, even more concisely:
df %>% summarise_each(funs(mean(is.na(.))))
data
df <- data.frame(
x = 1:10,
y = 1:10,
z = 1:10
)
df$x[c(2, 5, 7)] <- NA
df$y[c(4, 5)] <- NA

davechilders
- 8,693
- 2
- 18
- 18