How to count total non-na values using dplyr?

Question

Data

df <- structure(list(error_code = c("2892141", "2892141", "2892141", 
"2892141", "2892141", "2892141", "2892141", "2892141", "2892141", 
"2892141", "2892141", "2892141", "2892141", "2892141", "2892141", 
"2892141", "2907630", "2907630", "2907630", "2907630", "2907630", 
"2907630", "2907630", "2907630", "2907630", "2907630", "2907630", 
"2907630", "2907630", "2907630", "2907630", "2907630", "2907631", 
"2907631", "2907631", "2907631", "2907631", "2907631", "2907631", 
"2907631", "2907631", "2907631", "2907631", "2907631", "2907631", 
"2907631", "2907631", "2907631", "2907632", "2907632", "2907632", 
"2907632", "2907632", "2907632", "2907632", "2907632", "2907632", 
"2907632", "2907632", "2907632", "2907632", "2907632", "2907632", 
"2907632", "2907633", "2907633", "2907633", "2907633", "2907633", 
"2907633", "2907633", "2907633", "2907633", "2907633", "2907633", 
"2907633", "2907633", "2907633", "2907633", "2907633", "2907634", 
"2907634", "2907634", "2907634", "2907634", "2907634", "2907634", 
"2907634", "2907634", "2907634", "2907634", "2907634", "2907634", 
"2907634", "2907634", "2907634", "2907635", "2907635", "2907635", 
"2907635", "2907635", "2907635", "2907635", "2907635", "2907635", 
"2907635", "2907635", "2907635", "2907635", "2907635", "2907635", 
"2907635", "2907636", "2907636", "2907636", "2907636", "2907636", 
"2907636", "2907636", "2907636", "2907636", "2907636", "2907636", 
"2907636", "2907636", "2907636", "2907636", "2907636", "2907637", 
"2907637", "2907637", "2907637", "2907637", "2907637", "2907637", 
"2907637", "2907637", "2907637", "2907637", "2907637", "2907637", 
"2907637", "2907637", "2907637", "2907638", "2907638", "2907638", 
"2907638", "2907638", "2907638", "2907638", "2907638", "2907638", 
"2907638", "2907638", "2907638", "2907638", "2907638", "2907638", 
"2907638", "2907639", "2907639", "2907639", "2907639", "2907639", 
"2907639", "2907639", "2907639", "2907639", "2907639", "2907639", 
"2907639", "2907639", "2907639", "2907639", "2907639", "2907640", 
"2907640", "2907640", "2907640", "2907640", "2907640", "2907640", 
"2907640", "2907640", "2907640", "2907640", "2907640", "2907640", 
"2907640", "2907640", "2907640", "2907641", "2907641", "2907641", 
"2907641", "2907641", "2907641", "2907641", "2907641", "2907641", 
"2907641", "2907641", "2907641", "2907641", "2907641", "2907641", 
"2907641", "2907642", "2907642", "2907642", "2907642", "2907642", 
"2907642", "2907642", "2907642", "2907642", "2907642", "2907642", 
"2907642", "2907642", "2907642", "2907642", "2907642", "2907644", 
"2907644", "2907644", "2907644", "2907644", "2907644", "2907644", 
"2907644", "2907644", "2907644", "2907644", "2907644", "2907644", 
"2907644", "2907644", "2907644", "2907645", "2907645", "2907645", 
"2907645", "2907645", "2907645", "2907645", "2907645", "2907645", 
"2907645", "2907645", "2907645", "2907645", "2907645", "2907645", 
"2907645"), value = c(" ", NA, NA, NA, "2", NA, NA, NA, NA, "1", 
"1", "2", "2", NA, "2", "2", NA, NA, " ", NA, NA, NA, NA, "1", 
NA, "2", "2", "2", NA, NA, NA, NA, "2", NA, " ", NA, NA, NA, 
"1", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "2", " ", NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "2", NA, NA, NA, 
NA, NA, NA, NA, "1", "2", NA, NA, " ", NA, NA, NA, "2", NA, NA, 
NA, "2", NA, "1", NA, "2", NA, NA, NA, "2", NA, " ", NA, "2", 
" ", NA, NA, NA, "2", NA, NA, NA, "2", NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, "2", NA, "1", "1", "2", NA, NA, " ", 
"2", "2", " ", "2", "2", NA, "1", "2", "2", NA, "1", NA, NA, 
"1", "2", NA, NA, "1", NA, NA, "1", "1", NA, "1", "2", NA, "1", 
"2", NA, "2", "2", " ", "1", "2", " ", NA, NA, NA, NA, NA, NA, 
NA, "2", NA, NA, "2", NA, NA, NA, "2", NA, NA, "2", NA, NA, "1", 
"1", NA, NA, "2", NA, NA, NA, " ", "2", "2", NA, NA, "2", "2", 
NA, NA, NA, "2", NA, NA, NA, NA, NA, NA, " ", " ", NA, NA, "2", 
"2", NA, NA, NA, NA, "2", NA, NA, NA, NA, NA, NA, "1", NA, "2", 
NA, "1", NA, "1", "2", "1", "2", NA, NA, NA, NA, NA, " ", NA, 
NA, "2", NA, NA, NA, "1", NA, "1", NA, "1", " ", NA, NA, "1", 
"1")), row.names = c(NA, -256L), class = "data.frame")

I want to group each error_codes and count how many of them have non-na value. This is just a part of the data in which the column name is value, but there are other column names with similar values but are called something else.

So, I have the column name stored in a variable whichever is needed.

categoryColumn <- getCategoryColumnName(categoryId)

In this case, categoryColumn would be value. Then, to group and extract the total non-na rows, this is what I did

df%>% group_by(error_code) %>% select(categoryColumn) %>% summarise(ties = sum(is.na(.)))

The code above gives me this:

  error_code         value
 1 2892141           162
 2 2907630           162
 3 2907631           162
 4 2907632           162
 5 2907633           162
 6 2907634           162
 7 2907635           162
 8 2907636           162
 9 2907637           162
10 2907638           162
11 2907639           162
12 2907640           162
13 2907641           162
14 2907642           162
15 2907644           162
16 2907645           162

That's not accurate. How can I get the value that pertains to non-na values?

akrun · Accepted Answer · 2020-09-16T23:02:31.953

4

If it is the count of non-NA, create a logical vector (!is.na(value)) and get the sum

library(dplyr)
df %>%
      group_by(error_code) %>%
      summarise(value = sum(!is.na(value)))

If the column name/names is stored in an object, use across as it is generalized and can work for multiple columns

df %>%
   group_by(error_code) %>%
   summarise(across(categoryColumn, ~ sum(!is.na(.))))

sum(is.na(.)) outside the across will check for the total sum of all the elements as the . here implies the whole dataset and not the column per se

edited Sep 16 '20 at 23:02

answered Sep 16 '20 at 22:57

akrun

874,273
37
540
662

1

That's great! I accidentally missed the `sum(!is.na())` in my code while trying multiple thing, but the result was still the same. But your second code with `across` worked great. That's exactly what I was looking for. Thanks!! – user1828605 Sep 16 '20 at 23:02

score 1 · Answer 2 · answered Sep 16 '20 at 23:02

A base R solution would be:

#Code
aggregate(value~error_code,df,FUN = function(x) length(which(!is.na(x))))

Output:

   error_code value
1     2892141     8
2     2907630     5
3     2907631     3
4     2907632     2
5     2907633     4
6     2907634     6
7     2907635     4
8     2907636     6
9     2907637    10
10    2907638    11
11    2907639     4
12    2907640     7
13    2907641     5
14    2907642     4
15    2907644     8
16    2907645     7

How to count total non-na values using dplyr?

2 Answers2