Count number of non-NA values for every column in a dataframe

Question

I have a big dataset that contains a lot of NAs and some non-Na values. At the moment I count my non-NA values for each column like this:

 attach(df)
 1000 - (sum(is.na(X1)))
 1000 - (sum(is.na(X2)))
 1000 - (sum(is.na(X3)))
 1000 - (sum(is.na(X4)))
 1000 - (sum(is.na(X5)))
 ...
 detach(df)

So my overall length of my observations - the sum of my NA values.

Is there a faster way which uses less code lines and typing effort and gives me fast overview of all the columns and numbers of non-NA values?

Like a for loop or something?

I am looking for something like this:

  X1     Amount of Non-Na-Values
  X2     ...
  X3     ...
  X4
  X5
  X6

Thank you :)

It's easy to create bugs. If you do any subsetting, grouping, or rearranging of your data your attached columns can get out of sync with each other and out of sync with the actual data. And there are better alternatives: many functions have a `data` argument so you don't need it at all, and you can use `with()` for anything else. See [Why is it advisable to not use attach() in R?](https://stackoverflow.com/q/10067680/903061) and [When to use with() and why is it good?](https://stackoverflow.com/q/10067680/903061) — Gregor Thomas, May 16 '18 at 14:21

alistaire · Accepted Answer · 2018-05-15T19:29:18.830

You can also call is.na on the entire data frame (implicitly coercing to a logical matrix) and call colSums on the inverted response:

# make sample data
set.seed(47)
df <- as.data.frame(matrix(sample(c(0:1, NA), 100*5, TRUE), 100))

str(df)
#> 'data.frame':    100 obs. of  5 variables:
#>  $ V1: int  NA 1 NA NA 1 NA 1 1 1 NA ...
#>  $ V2: int  NA NA NA 1 NA 1 0 1 0 NA ...
#>  $ V3: int  1 1 0 1 1 NA NA 1 NA NA ...
#>  $ V4: int  NA 0 NA 0 0 NA 1 1 NA NA ...
#>  $ V5: int  NA NA NA 0 0 0 0 0 NA NA ...

colSums(!is.na(df))
#> V1 V2 V3 V4 V5 
#> 69 55 62 60 70

score 7 · Answer 2 · edited Nov 03 '21 at 09:17

7

With dplyr, that would be:

library(dplyr)

df %>%
summarise_all(funs(sum(!is.na(.))))
#  V1 V2 V3 V4 V5
#1 65 66 70 62 74

The advantage of that approach is that you can use group_by before, and that you don't need to care about column names (it just summarizes all of them).

edited Nov 03 '21 at 09:17

zx8754

52,746
12
114
209

answered May 15 '18 at 19:20

arg0naut91

14,574
2
17
38

score 5 · Answer 3 · answered May 15 '18 at 19:16

5

Try this:

nonNA_counts <- sapply(df, function(x) sum(!is.na(x)))

answered May 15 '18 at 19:16

Melissa Key

4,476
12
21

For colwise operations on data frames, isn't `apply(df, 2, function(x) sum(!is.na(x)))` more consistent? – Mikael Poul Johannesson May 15 '18 at 19:20
2

@MikaelPoulJohannesson Generally, us `apply` on matrices, use `sapply` or `lapply` on data frames. (In this case it doesn't much matter...) – Gregor Thomas May 15 '18 at 19:30

Count number of non-NA values for every column in a dataframe

3 Answers3

Linked

Related