2

I have a tidy data set which describes attributes of products. Each product have many attributes, and each attribute is described in each row. My goal is to do some calculations on each product, without using loops. The reason for not wanting to use loops is that there are several hundreds of thousands of products, and thus many million attributes.

Toy dataset with only one product:

df <- data.frame(productID = 1, attributeID = seq(1,15,1), dataType = c('range', 'range', 'predefined', 'predefined', 'bool', 'bool', 'bool', 'bool', 'double', 'double', 'double', 'double', 'double', 'double', 'double'), double = c(NA,NA,NA,NA,NA,NA,NA,NA,0,0,15,11.4,6,0,0), logical = c(NA,NA,NA,NA,TRUE,FALSE,FALSE,FALSE,NA,NA,NA,NA,NA,NA,NA), predefined = c(NA,NA,'Black','Round',NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA), from.value = c(0,0,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA), to.value = c(249,368,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))

   # productID attributeID   dataType double logical predefined from.value to.value
# 1          1           1      range     NA      NA       <NA>          0      249
# 2          1           2      range     NA      NA       <NA>          0      368
# 3          1           3 predefined     NA      NA      Black         NA       NA
# 4          1           4 predefined     NA      NA      Round         NA       NA
# 5          1           5       bool     NA    TRUE       <NA>         NA       NA
# 6          1           6       bool     NA   FALSE       <NA>         NA       NA
# 7          1           7       bool     NA   FALSE       <NA>         NA       NA
# 8          1           8       bool     NA   FALSE       <NA>         NA       NA
# 9          1           9     double    0.0      NA       <NA>         NA       NA
# 10         1          10     double    0.0      NA       <NA>         NA       NA
# 11         1          11     double   15.0      NA       <NA>         NA       NA
# 12         1          12     double   11.4      NA       <NA>         NA       NA
# 13         1          13     double    6.0      NA       <NA>         NA       NA
# 14         1          14     double    0.0      NA       <NA>         NA       NA
# 15         1          15     double    0.0      NA       <NA>         NA       NA

For example, how would one go about counting the zeros for each product in the double column?

Jaap
  • 81,064
  • 34
  • 182
  • 193
pkg
  • 63
  • 1
  • 8

3 Answers3

2

Since you're only after counting the number of zeros in the double column, the following should help:

library(tidyverse)
df %>% 
  group_by(productID) %>% 
  summarise(sum.of.zeros=sum(double==0, na.rm = T))

The above sums the instances where double equals zero (if it equals zero, it would counted as 1 (TRUE) and if not it would be 0 (FALSE). The na.rm = T is required because the expression NA==0 would return an NA.

Vlad C.
  • 944
  • 7
  • 12
1

Take a look at the tidyverse packages, and dplyr in particular

library(tidyverse)

df %>% group_by( productID, from.value ) %>% summarise( amount = n_distinct( attributeID ))

# # A tibble: 2 x 3
# # Groups:   productID [?]
#   productID from.value amount
#       <dbl>      <dbl>  <int>
# 1         1          0      2
# 2         1         NA     13
Wimpel
  • 26,031
  • 1
  • 20
  • 37
  • This can get a little messy if you change "from.value" with "double" in your answer, because then all unique values are listed (also all those that are not zero). Isn't there an easier way? – pkg Sep 10 '18 at 11:48
  • @pkg: not sure what you mean? dplyr was made for these kind of jobs... You can always `round` the `double`-column if you wish to use that.. or create a new grouping-column based on the values in `double` – Wimpel Sep 10 '18 at 11:50
1

With data.table you can do:

library("data.table")
setDT(df)[, sum(na.omit(double)==0), productID]

or

setDT(df)[, sum(double==0, na.rm=TRUE), productID]
jogo
  • 12,469
  • 11
  • 37
  • 42