11

I have a data.table like:

library(data.table)
widgets <- data.table(serial_no=1:100, 
                      color=rep_len(c("red","green","blue","black"),length.out=100),
                      style=rep_len(c("round","pointy","flat"),length.out=100),
                      weight=rep_len(1:5,length.out=100) )

Although I am not sure this is the most data.table way, I can calculate subgroup frequency by group using table and length in a single step-- for example, to answer the question "What percent of red widgets are round?"

edit: this code does not provide the right answer

# example A
widgets[, list(style = unique(style), 
               style_pct_of_color_by_count = 
                 as.numeric(table(style)/length(style)) ), by=color]

#    color  style style_pct_of_color_by_count
# 1:   red  round                        0.32
# 2:   red pointy                        0.32
# 3:   red   flat                        0.36
# 4: green pointy                        0.32
# ...

But I can't use that approach to answer questions like "By weight, what percent of red widgets are round?" I can only come up with a two-step approach:

# example B
widgets[,list(cs_weight=sum(weight)),by=list(color,style)][,list(style, style_pct_of_color_by_weight=cs_weight/sum(cs_weight)),by=color]

#    color  style style_pct_of_color_by_weight
# 1:   red  round                    0.3466667
# 2:   red pointy                    0.3466667
# 3:   red   flat                    0.3066667
# 4: green pointy                    0.3333333
# ...

I'm looking for a single-step approach to B, and A if improvable, in an explanation that deepens my understanding of data.table syntax for by-group operations. Please note that this question is different from Weighted sum of variables by groups with data.table because mine involves subgroups and avoiding multiple steps. TYVM.

Community
  • 1
  • 1
C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
  • Looking at response from @Frank below I noticed that my attempt A is not only awkward but incorrect -- for example, I checked `widgets[,sum(style=="round" & color=="red")/sum(color=="red")] # 0.36` – C8H10N4O2 Jun 19 '15 at 18:07

3 Answers3

16

This is almost a single step:

# A
widgets[,{
    totwt = .N
    .SD[,.(frac=.N/totwt),by=style]
},by=color]
    # color  style frac
 # 1:   red  round 0.36
 # 2:   red pointy 0.32
 # 3:   red   flat 0.32
 # 4: green pointy 0.36
 # 5: green   flat 0.32
 # 6: green  round 0.32
 # 7:  blue   flat 0.36
 # 8:  blue  round 0.32
 # 9:  blue pointy 0.32
# 10: black  round 0.36
# 11: black pointy 0.32
# 12: black   flat 0.32

# B
widgets[,{
    totwt = sum(weight)
    .SD[,.(frac=sum(weight)/totwt),by=style]
},by=color]
 #    color  style      frac
 # 1:   red  round 0.3466667
 # 2:   red pointy 0.3466667
 # 3:   red   flat 0.3066667
 # 4: green pointy 0.3333333
 # 5: green   flat 0.3200000
 # 6: green  round 0.3466667
 # 7:  blue   flat 0.3866667
 # 8:  blue  round 0.2933333
 # 9:  blue pointy 0.3200000
# 10: black  round 0.3733333
# 11: black pointy 0.3333333
# 12: black   flat 0.2933333

How it works: Construct your denominator for the top-level group (color) before going to the finer group (color with style) to tabulate.


Alternatives. If styles repeat within each color and this is only for display purposes, try a table:

# A
widgets[,
  prop.table(table(color,style),1)
]
#        style
# color   flat pointy round
#   black 0.32   0.32  0.36
#   blue  0.36   0.32  0.32
#   green 0.32   0.36  0.32
#   red   0.32   0.32  0.36

# B
widgets[,rep(1L,sum(weight)),by=.(color,style)][,
  prop.table(table(color,style),1)
]

#        style
# color        flat    pointy     round
#   black 0.2933333 0.3333333 0.3733333
#   blue  0.3866667 0.3200000 0.2933333
#   green 0.3200000 0.3333333 0.3466667
#   red   0.3066667 0.3466667 0.3466667

For B, this expands the data so that there is one observation for each unit of weight. With large data, such an expansion would be a bad idea (since it costs so much memory). Also, weight has to be an integer; otherwise, its sum will be silently truncated to one (e.g., try rep(1,2.5) # [1] 1 1).

Frank
  • 66,179
  • 8
  • 96
  • 180
  • This is what I do, but I'd be interested in finding a better way, too. – Frank Jun 19 '15 at 17:56
  • 1
    Thanks @Frank -- it is going to take me a while to grok the dot notation and the embedded assignments, but this is a great approach. – C8H10N4O2 Jun 19 '15 at 22:35
  • 1
    Your first version can be rewritten without temp variable as follows: `widgets[, .(frac = .SD[, .N, by=style]$N / .N), by=color]` – Arun Jun 21 '15 at 18:35
  • @Arun The `style` column should be in the result, too. – Frank Jun 21 '15 at 18:43
2

it may be a good idea to use dplyr

df <- widgets %>% 
  group_by(color, style) %>%
  summarise(count = n()) %>%
  mutate(freq = count/sum(count))

df2 <- widgets %>% 
  group_by(color, style) %>%
  summarise(count_w = sum(weight)) %>%
  mutate(freq = count_w/sum(count_w))  
drsh1
  • 54
  • 7
  • 2
    Thanks @drsh1 I appreciate that `dplyr` is intuitive and useful here. My question specifically though was how to use `data.table` syntax. – C8H10N4O2 Jun 19 '15 at 22:32
2

Calculate a frequency table for each style within color and then for each row look up the frequency for that row's style in that table finally dividing by the number of rows within that color.

widgets[, frac := table(style)[style] / .N, by = color]

giving:

  > widgets
     serial_no color  style weight frac
  1:         1   red  round      1 0.36
  2:         2 green pointy      2 0.36
  3:         3  blue   flat      3 0.36
  4:         4 black  round      4 0.36
  5:         5   red pointy      5 0.32
  6:         6 green   flat      1 0.32
  7:         7  blue  round      2 0.32
  8:         8 black pointy      3 0.32
  9:         9   red   flat      4 0.32
 10:        10 green  round      5 0.32
 ... etc ...

This could readily be translated to base or dplyr, if desired:

# base
prop <- function(x) table(x)[x] / length(x)
transform(widgets, frac = ave(style, color, FUN = prop))

# dplyr - uses prop function from above
library(dplyr)
widgets %>% group_by(color) %>% mutate(frac = prop(style)) %>% ungroup
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341