How to group by and get count by two variables, and 'spread' the second variable, all in data.table

Question

I have data for several years for different locations. I want to get a count by year, by location:

library(data.table)
t1 <- data.table(ID = 1:100, Year = 2010:2015, Loc = LETTERS[1:7])
t1 <- t1[, .N, by=.(Year, Loc)]

#    Year Type N
# 1: 2010    A 3
# 2: 2011    B 3
# 3: 2012    C 3
# ...

However, what I want is a data.table solution similar to the one below using dplyr:

library(dplyr)
spread(t1, Loc, N)

#    Year A B C D E F G
# 1: 2010 3 2 2 2 2 3 3
# ...

How can I do this using data.table? I have millions of rows across ~100 variables many of which have thousands of factor levels, and I'd prefer to stick to data.table. I tried a number of lapply solutions to other group-by-two-variable questions, but they all threw the error .N is not a function? Am I missing something really obvious?

MKR · Accepted Answer · 2018-05-28T17:59:23.223

One can use data.table::dcast to spread the data. In the example by used by OP, one can add dcast as:

dcast(t1, Year ~ Loc, value.var = "N")

#    Year A B C D E F G
# 1: 2010 3 2 2 2 2 3 3
# 2: 2011 3 3 2 2 2 2 3
# 3: 2012 3 3 3 2 2 2 2
# 4: 2013 2 3 3 3 2 2 2
# 5: 2014 2 2 2 3 3 2 2
# 6: 2015 2 2 2 2 3 3 2

Even the 2nd intermediate step as shown by OP is not needed. One can use fun.aggregate = length to dcast to get the count for each location.

t1 <- data.table(ID = 1:100, Year = 2010:2015, Loc = LETTERS[1:7])

# Now use 'fun.aggregate' = length
dcast(t1, Year ~ Loc, fun.aggregate = length, value.var = "Loc")
#    Year A B C D E F G
# 1: 2010 3 2 2 2 2 3 3
# 2: 2011 3 3 2 2 2 2 3
# 3: 2012 3 3 3 2 2 2 2
# 4: 2013 2 3 3 3 2 2 2
# 5: 2014 2 2 2 3 3 2 2
# 6: 2015 2 2 2 2 3 3 2

Your answer worked perfectly, thank you! I wish I had asked the question much earlier in the process.... and stepped outside the bounds of the data.table cheat sheet.... instead of banging my head on the wall. — armipunk, May 28 '18 at 18:57

How to group by and get count by two variables, and 'spread' the second variable, all in data.table

1 Answers1