0

After some time away from R I feel I am making very clumsy code to get basic summary statistics in data.table.

What I am doing is finding proportion of individuals in good/bad health conditional on species.

# Some data 
n = 300
set.seed(2)
dt <- data.table(type = sample(x = c("Dog", "Cat", "Horse"), size = n, replace = TRUE),
                 health = sample(x = c("Good", "Bad"), size = n, replace = TRUE))

# Making the table. In a clumsy manner?
dt.fr <- dt[, .N, .(type, health)][, perc.type := N/sum(N)*100, 
                                   by = type][order(type, health)]
dt.fr

    type health  N perc.type
1:   Cat    Bad 38  44.70588
2:   Cat   Good 47  55.29412
3:   Dog    Bad 56  50.90909
4:   Dog   Good 54  49.09091
5: Horse    Bad 61  58.09524
6: Horse   Good 44  41.90476

How would I produce the table above with more elegant code?

s_baldur
  • 29,441
  • 4
  • 36
  • 69
  • 4
    I guess it's a subjective question. I think your way is fine; kind of has to be done in two steps since you're aggregating on two levels. You could nest the steps instead of chaining them, but I think that's harder to read: `dt[, {NN = .N; .SD[, .(N = .N, perc.type = 100*.N/NN), keyby=health]}, keyby=type]` – Frank Aug 26 '16 at 00:37
  • 1
    dt[, perc.type := prop.table(health), by = type]. Then use setorder to order the column values by reference. Note, I did not try this as I am away from my desk. – Sathish Aug 26 '16 at 00:45
  • Interesting @Sathish but gives error `Error in sum(x) : invalid 'type' (character) of argument` – s_baldur Aug 26 '16 at 00:49
  • `within(data.frame(table(dt)), P <- ave(Freq, type, FUN = prop.table) * 100)` – rawr Aug 26 '16 at 00:53
  • 2
    Hope this answer of mine may help you http://stackoverflow.com/questions/38778447/proportional-tables-by-group/38779415#38779415 – Sathish Aug 26 '16 at 00:54
  • 1
    I'd just use `prop.table` so it's obvious what you're doing: `dt[, .N, by = .(type, health)][, perc.type := prop.table(N), by = type][]` – alistaire Aug 26 '16 at 00:59
  • Thanks @alistaire. But I guess it depends on exposure to R and math which one you find more obvious. – s_baldur Aug 26 '16 at 01:04
  • 2
    R, yes; math...maybe. Really, I just start to zone out when reading somebody's hard-coded stats (especially when it's longer), so I stick to existing functions when there is one. TBH, I'd use the dplyr `dt %>% count(type, health) %>% mutate(perc.type = prop.table(n))` anyway, but that's a different war. – alistaire Aug 26 '16 at 01:10

0 Answers0