2

I have a big dataset that contains a lot of NAs and some non-Na values. At the moment I count my non-NA values for each column like this:

 attach(df)
 1000 - (sum(is.na(X1)))
 1000 - (sum(is.na(X2)))
 1000 - (sum(is.na(X3)))
 1000 - (sum(is.na(X4)))
 1000 - (sum(is.na(X5)))
 ...
 detach(df)

So my overall length of my observations - the sum of my NA values.

Is there a faster way which uses less code lines and typing effort and gives me fast overview of all the columns and numbers of non-NA values?

Like a for loop or something?

I am looking for something like this:

  X1     Amount of Non-Na-Values
  X2     ...
  X3     ...
  X4
  X5
  X6  

Thank you :)

zx8754
  • 52,746
  • 12
  • 114
  • 209
Peter
  • 183
  • 1
  • 1
  • 9
  • 5
    Don't use `attach`, causes more problems than it solves. – Gregor Thomas May 15 '18 at 19:28
  • 2
    It's easy to create bugs. If you do any subsetting, grouping, or rearranging of your data your attached columns can get out of sync with each other and out of sync with the actual data. And there are better alternatives: many functions have a `data` argument so you don't need it at all, and you can use `with()` for anything else. See [Why is it advisable to not use attach() in R?](https://stackoverflow.com/q/10067680/903061) and [When to use with() and why is it good?](https://stackoverflow.com/q/10067680/903061) – Gregor Thomas May 16 '18 at 14:21

3 Answers3

15

You can also call is.na on the entire data frame (implicitly coercing to a logical matrix) and call colSums on the inverted response:

# make sample data
set.seed(47)
df <- as.data.frame(matrix(sample(c(0:1, NA), 100*5, TRUE), 100))

str(df)
#> 'data.frame':    100 obs. of  5 variables:
#>  $ V1: int  NA 1 NA NA 1 NA 1 1 1 NA ...
#>  $ V2: int  NA NA NA 1 NA 1 0 1 0 NA ...
#>  $ V3: int  1 1 0 1 1 NA NA 1 NA NA ...
#>  $ V4: int  NA 0 NA 0 0 NA 1 1 NA NA ...
#>  $ V5: int  NA NA NA 0 0 0 0 0 NA NA ...

colSums(!is.na(df))
#> V1 V2 V3 V4 V5 
#> 69 55 62 60 70
alistaire
  • 42,459
  • 4
  • 77
  • 117
7

With dplyr, that would be:

library(dplyr)

df %>%
summarise_all(funs(sum(!is.na(.))))
#  V1 V2 V3 V4 V5
#1 65 66 70 62 74

The advantage of that approach is that you can use group_by before, and that you don't need to care about column names (it just summarizes all of them).

zx8754
  • 52,746
  • 12
  • 114
  • 209
arg0naut91
  • 14,574
  • 2
  • 17
  • 38
5

Try this:

nonNA_counts <- sapply(df, function(x) sum(!is.na(x)))
Melissa Key
  • 4,476
  • 12
  • 21