-2

I would like to calculate percentage of NA-values in a dataframe and for variables.

My dataframe has this:

mean(is.na(dataframe))
# 0.03354

How I read this result? Na 0,033%? I don't understand.

For the individual variables I did the following for the count of NAs

sapply(DATAFRAME, function(x) sum(is.na(x)))

Then, for the percentage of NA-values:

colMeans(is.na(VARIABLEX)) 

Doesn't work because I get the following error:

"x must be an array of at least two dimension"

Why does this error occur? Anyway, afterwards I tried the following:

mean(is.na(VariableX))
# 0.1188

Should I interpret this as having 0.11% NA-values?

KenHBS
  • 6,756
  • 6
  • 37
  • 52
  • Welcome to SO! Please read [ask] and https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example and [mcve] ... then edit your question! – jogo Oct 05 '17 at 11:53
  • How can we store the percentages of missings by column in a new data frame? – user19226726 Jun 06 '23 at 00:53

2 Answers2

0

I'd just divide the number of rows containing NAs by the total number of rows:

df <- data.frame(data = c(NA, NA, NA, NA, 2, 4, NA, 7, NA))

percent_NA <- NROW(df[is.na(df$data),])/NROW(df)

Which gives:

> percent_NA
[1] 0.6666667

Which means there are 66,67% NAs in my dataframe

f.lechleitner
  • 3,554
  • 1
  • 17
  • 35
0

I don't understand the issue you are trying to solve. It all works as expected.
First, a dataset since you haven't provided one.

set.seed(6180)  # make it reproducible
dat <- data.frame(x = sample(c(1:4, NA), 100, TRUE),
                  y = sample(c(1:5, NA), 100, TRUE))

Now the code for sums.

s <- sapply(dat, function(x) sum(is.na(x)))
s
# x  y 
#18 13
sum(s)
#[1] 31
sum(is.na(dat))
#[1] 31

colSums(is.na(dat))
# x  y 
#18 13

The same goes for means, be it mean or colMeans.
EDIT.
Here is the code to get the means of NA values per column/variable and a grand total.

sapply(dat, function(x) mean(is.na(x)))
#   x    y 
#0.18 0.13
colMeans(is.na(dat))   # Same result, faster
#   x    y 
#0.18 0.13
mean(is.na(dat))       # overall mean
#[1] 0.155
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • I would percentage Na Values dataframe and variables. I have dataframe [1] 44750 7. For % of 7 variables I did -> percentvar <- nrow(df [is.na(df$variable),])/NROW(df) - for each variable. For % on dataframe I did ->sum(is.na(df))/prod(dim(df)) -. Is it correctly for you? – jessica scucchia Oct 05 '17 at 12:24
  • @jessicascucchia OK, I will edit my question. There are simple ways of doing what you want. Note that in the code above you really don't need `sapply`, `colSums` and `colMeans` do it for you and are more efficient. – Rui Barradas Oct 05 '17 at 12:40
  • @jessicascucchia And yes, `sum(is.na(df))/prod(dim(df))` does give the same result as my last line of code. But mine is simpler. **Note:** don't name your data frame `df` since it already is a `base R` function. – Rui Barradas Oct 05 '17 at 12:46