Percentuage on Na Values (Dataframe and Variables) on R

Question

I would like to calculate percentage of NA-values in a dataframe and for variables.

My dataframe has this:

mean(is.na(dataframe))
# 0.03354

How I read this result? Na 0,033%? I don't understand.

For the individual variables I did the following for the count of NAs

sapply(DATAFRAME, function(x) sum(is.na(x)))

Then, for the percentage of NA-values:

colMeans(is.na(VARIABLEX))

Doesn't work because I get the following error:

"x must be an array of at least two dimension"

Why does this error occur? Anyway, afterwards I tried the following:

mean(is.na(VariableX))
# 0.1188

Should I interpret this as having 0.11% NA-values?

Welcome to SO! Please read [ask] and https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example and [mcve] ... then edit your question! — jogo, Oct 05 '17 at 11:53
How can we store the percentages of missings by column in a new data frame? — user19226726, Jun 06 '23 at 00:53

score 0 · Answer 1 · answered Oct 05 '17 at 11:51

I'd just divide the number of rows containing NAs by the total number of rows:

df <- data.frame(data = c(NA, NA, NA, NA, 2, 4, NA, 7, NA))

percent_NA <- NROW(df[is.na(df$data),])/NROW(df)

Which gives:

> percent_NA
[1] 0.6666667

Which means there are 66,67% NAs in my dataframe

Rui Barradas · Answer 2 · 2017-10-05T12:43:15.487

0

I don't understand the issue you are trying to solve. It all works as expected.
First, a dataset since you haven't provided one.

set.seed(6180)  # make it reproducible
dat <- data.frame(x = sample(c(1:4, NA), 100, TRUE),
                  y = sample(c(1:5, NA), 100, TRUE))

Now the code for sums.

s <- sapply(dat, function(x) sum(is.na(x)))
s
# x  y 
#18 13
sum(s)
#[1] 31
sum(is.na(dat))
#[1] 31

colSums(is.na(dat))
# x  y 
#18 13

The same goes for means, be it mean or colMeans.
EDIT.
Here is the code to get the means of NA values per column/variable and a grand total.

sapply(dat, function(x) mean(is.na(x)))
#   x    y 
#0.18 0.13
colMeans(is.na(dat))   # Same result, faster
#   x    y 
#0.18 0.13
mean(is.na(dat))       # overall mean
#[1] 0.155

edited Oct 05 '17 at 12:43

answered Oct 05 '17 at 11:59

Rui Barradas

70,273
8
34
66

I would percentage Na Values dataframe and variables. I have dataframe [1] 44750 7. For % of 7 variables I did -> percentvar <- nrow(df [is.na(df$variable),])/NROW(df) - for each variable. For % on dataframe I did ->sum(is.na(df))/prod(dim(df)) -. Is it correctly for you? – jessica scucchia Oct 05 '17 at 12:24
@jessicascucchia OK, I will edit my question. There are simple ways of doing what you want. Note that in the code above you really don't need `sapply`, `colSums` and `colMeans` do it for you and are more efficient. – Rui Barradas Oct 05 '17 at 12:40
@jessicascucchia And yes, `sum(is.na(df))/prod(dim(df))` does give the same result as my last line of code. But mine is simpler. **Note:** don't name your data frame `df` since it already is a `base R` function. – Rui Barradas Oct 05 '17 at 12:46

Percentuage on Na Values (Dataframe and Variables) on R

2 Answers2