Count non-NA values by group

Question

Here is my example

mydf<-data.frame('col_1' = c('A','A','B','B'), 'col_2' = c(100,NA, 90,30))

I would like to group by col_1 and count non-NA elements in col_2

I would like to do it with dplyr. Here is what I tried:

mydf %>% group_by(col_1) %>% summarise_each(funs(!is.na(col_2)))
mydf %>% group_by(col_1) %>% mutate(non_na_count = length(col_2, na.rm=TRUE))
mydf %>% group_by(col_1) %>% mutate(non_na_count = count(col_2, na.rm=TRUE))

Nothing worked. Any suggestions?

score 74 · Accepted Answer · answered May 31 '17 at 17:02

74

You can use this

mydf %>% group_by(col_1) %>% summarise(non_na_count = sum(!is.na(col_2)))

# A tibble: 2 x 2
   col_1 non_na_count
  <fctr>        <int>
1      A            1
2      B            2

answered May 31 '17 at 17:02

Richard Telford

9,558
6
38
51

15

For getting a summary for all columns, use `summarise_all(funs(sum(!is.na(.))))` – cacti5 Jul 19 '18 at 18:20
If applying another summary function to col_2, be careful which order you request the calculations. `my_df %>% group_by(col_1) %>% summarise(col_1 = mean(col_1, na.rm = T), non_na_count = sum(!is.na(col_2)))` produces a difference result than `my_df %>% group_by(col_1) %>% summarise(non_na_count = sum(!is.na(col_2)), col_1 = mean(col_1, na.rm = T))` – zack Apr 02 '20 at 21:09
@zack I get identical results for both orders (I am using dplyr version 0.8.99.9002 from github). – Richard Telford Apr 03 '20 at 12:46
@RichardTelford I made a mistake in typing my comment. Instead of `col_1 = mean(col_1, na.rm = T)` in the call to `summarise`, try `col_2 = mean(col_2, na.rm = T)`. Using dplyr version 0.8.3, I get different results. – zack Apr 05 '20 at 00:47
Why does it work with the sum function because it is counting? – Ariel Dec 22 '22 at 15:11
1

The `!is.na()` converts the data to TRUE/FALSE. sum() treats the TRUE as 1 and the FALSE as 0, so the sum is the count of the not NA values – Richard Telford Dec 23 '22 at 16:52

akrun · Answer 2 · 2017-05-31T20:59:27.487

8

We can filter the NA elements in 'col_2' and then do a count of 'col_1'

mydf %>%
     filter(!is.na(col_2))  %>%
      count(col_1)
# A tibble: 2 x 2
#   col_1     n
#  <fctr> <int>
#1      A     1
#2      B     2

or using data.table

library(data.table)
setDT(mydf)[, .(non_na_count = sum(!is.na(col_2))), col_1]

Or with aggregate from base R

aggregate(cbind(col_2 = !is.na(col_2))~col_1, mydf, sum)
#  col_1 col_2
#1     A     1
#2     B     2

Or using table

table(mydf$col_1[!is.na(mydf$col_2)])

edited May 31 '17 at 20:59

answered May 31 '17 at 17:25

akrun

874,273
37
540
662

Why isn't the last answer using table: table(mydf$col_1[ , ! is.na(mydf$col_2)])? – W Barker Jun 29 '19 at 13:12

score 5 · Answer 3 · edited Mar 27 '19 at 23:23

5

library(knitr)
library(dplyr)

mydf <- data.frame("col_1" = c("A", "A", "B", "B"), 
                   "col_2" = c(100, NA, 90, 30))

mydf %>%
  group_by(col_1) %>%
  select_if(function(x) any(is.na(x))) %>%
  summarise_all(funs(sum(is.na(.)))) -> NA_mydf

kable(NA_mydf)

edited Mar 27 '19 at 23:23

Tung

26,371
7
91
115

answered Feb 03 '18 at 06:59

Anya Sti

131
2
5

Count non-NA values by group

3 Answers3

Linked

Related