3

I have the following data.table with R

library(data.table)
dt = data.table(ID = c("person1", "person1", "person1", "person2", "person2", "person2", "person2", "person2", ...), category = c("red", "red", "blue", "red", "red", "blue", "green", "green", ...))

dt
ID         category
person1    red
person1    red
person1    blue
person2    red
person2    red
person2    blue
person2    green
person2    green
person3    blue
....

I am looking how to create a "frequency" of the categorical variables red, blue, green for each unique ID, and then expand these columns to record the counts for each. The resulting data.table would look like this:

dt
ID        red    blue    green
person1   2      1       0
person2   2      1       2    
...

I incorrectly thought the correct way to start this with data.table would be to compute the table() by group, e.g.

dt[, counts :=table(category), by=ID]

But this appears to count the total number of categorical values by group ID. This also doesn't solve my problem of "expanding" the data.table.

What is the correct way to do this?

ShanZhengYang
  • 16,511
  • 49
  • 132
  • 234

3 Answers3

2

Like this?

library(data.table)
library(dplyr)
dt[, .N, by = .(ID, category)] %>% dcast(ID ~ category)

If you want to add these columns to the original data.table

counts <- dt[, .N, by = .(ID, category)] %>% dcast(ID ~ category) 
counts[is.na(counts)] <- 0
output <- merge(dt, counts, by = "ID")
amatsuo_net
  • 2,409
  • 11
  • 20
  • This works! One question (as I'm not as familiar with `dpylr`): let's say the original `dt` had several columns: what if i wanted to keep another column? At the moment, `dcast(ID ~ category)` results in a data.table with only ID and the categories (like in my example). – ShanZhengYang Jun 05 '17 at 17:01
  • See my edit. You can merge the table data to the original. – amatsuo_net Jun 05 '17 at 17:10
2

You can use the reshape library with one line.

library(reshape2)
dcast(data=dt,
      ID ~ category,
      fun.aggregate = length,
      value.var = "category")

       ID blue green red
1 person1    1     0   2
2 person2    1     2   2

In addition if you just need a simple 2-way table, you can use the builtin R table function.

table(dt$ID,dt$category)

akaDrHouse
  • 2,190
  • 2
  • 20
  • 29
1

This is done in the imperative style, there's probably a cleaner, functional way to do it.

library(data.table)
library(dtplyr)
dt = data.table(ID = c("person1", "person1", "person1", "person2", "person2", "person2", "person2", "person2"), 
                category = c("red", "red", "blue", "red", "red", "blue", "green", "green"))


ids <- unique(dt$ID)
categories <- unique(dt$category)
counts <- matrix(nrow=length(ids), ncol=length(categories))
rownames(counts) <- ids
colnames(counts) <- categories

for (i in seq_along(ids)) {
  for (j in seq_along(categories)) {
    count <- dt %>%
      filter(ID == ids[i], category == categories[j]) %>%
      nrow()

    counts[i, j] <- count
  }
}

Then:

>counts
##         red blue green
##person1   2    1     0
##person2   2    1     2
Julian Zucker
  • 564
  • 4
  • 13