compressing similar columns when converting a data.table from long to wide

Question

I am working with a longitudinal data set with repeated observations for subjects in a long format data.table. Most subjects have a few (<10) repeated observations, while a few subjects have many (>100) observations. I can convert this data set from long to wide as below, but it becomes extremely wide (I have many variables at each time point) and is mostly full of NAs, since most subjects don't have data for variables at times 11 through 100. Is there a more elegant way to recast this data to wide format? I'm thinking something along the lines of a ragged array in other languages...

Some solutions exist here, but a big concern for me is object size: the wide matrix with lots of NAs takes up a lot of unnecessary space.

A MWE with my current (undesirably sparse matrix) solution is below. Ideally, if some sort of ragged list approach is feasible, the resulting object would have 3 rows and 3 columns, where the "year" and "code" columns are lists or similar. As a bonus, it would be wonderful if I could nest the "code" variable inside the "year" variable as nested ragged arrays.

library(data.table)

dat <- data.table(id=c(rep(1,5), rep(2,10), rep(3,85)),
    year=sample(2013:2016, 100, replace=TRUE),
    code=sample(LETTERS, 100, replace=TRUE))

wideDat <- dcast(dat, id~paste0("code", dat[,seq_len(.N), by=id]$V1), 
    value.var="code")

Could you also provide an object that looks like your desired output? I'm guessing it's unlike wideDat — Frank, Jan 19 '17 at 02:25
@Frank That's actually part of my question: I've only recently taken up using data.table, and have on many occasions found wonderful surprises about flexibly defining data objects. Hoping someone might enlighten me similarly here. In a pseudo-object: column `id=c(1,2,3)`; column `year` is a list of length 3, each item with only as many entries as there are longitudinal observations (not storing hundreds of NAs); column `code` is a list of length 3 defined analogously to `year`. Not necessarily stuck on lists (and not sure if feasible in the context of data.table), it's just what comes to mind. — Travis Gerke, Jan 19 '17 at 03:06
Ok. You can do `dat[, lapply(.SD, list), by=id]`, but that's not a very useful format for analysis or printing or anything else I can think of. — Frank, Jan 19 '17 at 03:24

score 2 · Accepted Answer · answered Jan 19 '17 at 05:37

A few ideas

object.size(wideDat)
# 22432 bytes

# the following structures leverages the fact that years are missing 
wideDat2 <- dcast(dat, id+year~code)   
#   id year A B C D E F G I J K L M N O P Q R S T U V W X Y Z
#1:  1 2014 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
#2:  1 2015 0 0 1 0 0 0 0 0 1 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#3:  2 2013 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#4:  2 2014 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0
#5:  2 2015 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#6:  2 2016 0 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
#7:  3 2013 1 0 0 1 0 1 1 2 2 0 1 1 2 1 1 2 3 0 1 0 0 2 0 0 1
#8:  3 2014 1 2 0 0 2 1 0 3 0 0 3 0 0 0 3 1 0 2 1 1 0 2 0 0 2
#9:  3 2015 0 2 1 0 0 0 0 2 2 1 1 0 0 0 1 0 3 1 2 1 2 1 1 0 0
#10: 3 2016 1 0 0 2 0 1 0 0 0 1 0 2 1 2 1 1 1 0 1 0 1 1 0 1 0

object.size(wideDat2)
# 6872 bytes

## the following struture just compresses the codes as strings
library(dplyr)
wideDat3 <- dat %>% 
  group_by(id, year) %>% 
  arrange(id, year, code) %>%
  summarize(codes = paste0(code, collapse=","))
#      id  year                                           codes
<#   dbl> <int>                                           <chr>
#1      1  2014                                               P
#2      1  2015                                         C,J,L,L
#3      2  2013                                               B
#4      2  2014                                             S,W
#5      2  2015                                             A,A
#6      2  2016                                       B,G,K,O,S
#7      3  2013   A,D,F,G,I,I,J,J,L,M,N,N,O,P,Q,Q,R,R,R,T,W,W,Z
#8      3  2014 A,B,B,E,E,F,I,I,I,L,L,L,P,P,P,Q,S,S,T,U,W,W,Z,Z
#9      3  2015       B,B,C,I,I,J,J,K,L,P,R,R,R,S,T,T,U,V,V,W,X
#10     3  2016               A,D,D,F,K,M,M,N,O,O,P,Q,R,T,V,W,Y
object.size(wideDat3)
# 2856 bytes

## .. or as nested list 
wideDat4 <- dat %>% 
  group_by(id, year) %>% 
  arrange(id, year, code) %>%
  summarize(codes = list(code))
#   id  year      codes
#<dbl> <int>     <list>
#  1      1  2014  <chr [1]>
#  2      1  2015  <chr [4]>
#  3      2  2013  <chr [1]>
#  4      2  2014  <chr [2]>
#  5      2  2015  <chr [2]>
#  6      2  2016  <chr [5]>
#  7      3  2013 <chr [23]>
#  8      3  2014 <chr [24]>
#  9      3  2015 <chr [21]>
#  10     3  2016 <chr [17]>

object.size(widedat4)
# 6776 bytes

So helpful, and thanks for landing on a key idea that I forgot to mention in my original question: object size. My actual data is much larger than the MWE, so I use a lot of unnecessary space with all the NAs in wideDat. Your wideDat4 got me where I wanted to go, with the modification `wideDat5 <- dat %>% group_by(id) %>% arrange(id, year, code) %>% summarize(years=list(year), codes = list(code))` delivering the "tibble" object (size 4960 bytes) which is exactly what I was seeking. — Travis Gerke, Jan 19 '17 at 14:05
@T.Gerke Fyi `wideDat6 = setnames(dat[order(id, year, code), lapply(.SD, list), by=id], c("id", "years", "codes"))` is another way to get there (using data.table). `all.equal(data.frame(wideDat5), data.frame(wideDat6)) # TRUE` with object size of `4904 bytes` — Frank, Jan 19 '17 at 15:20
@Frank That is _exactly_ what I was looking for: prints a little more cleanly than wideDat5, works within the confines of data.table (not that I'm terribly opposed to invoking dplyr, but does make things cleaner), and is slightly smaller in size than wideDat5 — Travis Gerke, Jan 19 '17 at 15:57

compressing similar columns when converting a data.table from long to wide

1 Answers1