Calculating on grouped rows without loops

Question

I have a tidy data set which describes attributes of products. Each product have many attributes, and each attribute is described in each row. My goal is to do some calculations on each product, without using loops. The reason for not wanting to use loops is that there are several hundreds of thousands of products, and thus many million attributes.

Toy dataset with only one product:

df <- data.frame(productID = 1, attributeID = seq(1,15,1), dataType = c('range', 'range', 'predefined', 'predefined', 'bool', 'bool', 'bool', 'bool', 'double', 'double', 'double', 'double', 'double', 'double', 'double'), double = c(NA,NA,NA,NA,NA,NA,NA,NA,0,0,15,11.4,6,0,0), logical = c(NA,NA,NA,NA,TRUE,FALSE,FALSE,FALSE,NA,NA,NA,NA,NA,NA,NA), predefined = c(NA,NA,'Black','Round',NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA), from.value = c(0,0,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA), to.value = c(249,368,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))

   # productID attributeID   dataType double logical predefined from.value to.value
# 1          1           1      range     NA      NA       <NA>          0      249
# 2          1           2      range     NA      NA       <NA>          0      368
# 3          1           3 predefined     NA      NA      Black         NA       NA
# 4          1           4 predefined     NA      NA      Round         NA       NA
# 5          1           5       bool     NA    TRUE       <NA>         NA       NA
# 6          1           6       bool     NA   FALSE       <NA>         NA       NA
# 7          1           7       bool     NA   FALSE       <NA>         NA       NA
# 8          1           8       bool     NA   FALSE       <NA>         NA       NA
# 9          1           9     double    0.0      NA       <NA>         NA       NA
# 10         1          10     double    0.0      NA       <NA>         NA       NA
# 11         1          11     double   15.0      NA       <NA>         NA       NA
# 12         1          12     double   11.4      NA       <NA>         NA       NA
# 13         1          13     double    6.0      NA       <NA>         NA       NA
# 14         1          14     double    0.0      NA       <NA>         NA       NA
# 15         1          15     double    0.0      NA       <NA>         NA       NA

For example, how would one go about counting the zeros for each product in the double column?

Vlad C. · Accepted Answer · 2018-09-10T12:35:36.677

2

Since you're only after counting the number of zeros in the double column, the following should help:

library(tidyverse)
df %>% 
  group_by(productID) %>% 
  summarise(sum.of.zeros=sum(double==0, na.rm = T))

The above sums the instances where double equals zero (if it equals zero, it would counted as 1 (TRUE) and if not it would be 0 (FALSE). The na.rm = T is required because the expression NA==0 would return an NA.

edited Sep 10 '18 at 12:35

answered Sep 10 '18 at 12:07

Vlad C.

944
7
12

2

You could also do `sum(double == 0, na.rm = T)` – jdobres Sep 10 '18 at 12:19
Good point. Corrected. – Vlad C. Sep 10 '18 at 12:36

score 1 · Answer 2 · answered Sep 10 '18 at 11:36

1

Take a look at the tidyverse packages, and dplyr in particular

library(tidyverse)

df %>% group_by( productID, from.value ) %>% summarise( amount = n_distinct( attributeID ))

# # A tibble: 2 x 3
# # Groups:   productID [?]
#   productID from.value amount
#       <dbl>      <dbl>  <int>
# 1         1          0      2
# 2         1         NA     13

answered Sep 10 '18 at 11:36

Wimpel

26,031
1
20
37

This can get a little messy if you change "from.value" with "double" in your answer, because then all unique values are listed (also all those that are not zero). Isn't there an easier way? – pkg Sep 10 '18 at 11:48
@pkg: not sure what you mean? dplyr was made for these kind of jobs... You can always `round` the `double`-column if you wish to use that.. or create a new grouping-column based on the values in `double` – Wimpel Sep 10 '18 at 11:50

jogo · Answer 3 · 2018-09-10T12:43:41.503

1

With data.table you can do:

library("data.table")
setDT(df)[, sum(na.omit(double)==0), productID]

or

setDT(df)[, sum(double==0, na.rm=TRUE), productID]

edited Sep 10 '18 at 12:43

answered Sep 10 '18 at 12:11

jogo

12,469
11
37
42

Calculating on grouped rows without loops

3 Answers3