3

I am looking for a quick extension to the following solution posted here. In it Frank shows that for an example data table

test <- data.table("index"=rep(letters[1:10],100),"var1"=rnorm(1000,0,1))

You can quickly make dummies by using the following code

inds <- unique(test$index) ; test[,(inds):=lapply(inds,function(x)index==x)]

Now I want to extend this solution for a data.table that has multiple rows of indices, e.g.

new <- data.table("id" = rep(c("Jan","James","Dirk","Harry","Cindy","Leslie","John","Frank"),125), "index1"=rep(letters[1:5],200),"index2" = rep(letters[6:15],100),"index3" = rep(letters[16:19],250))

I need to do this for many dummies and ideally the solution would allow me to get 4 things:

  1. The total count of every index
  2. The mean times every index occurs
  3. The count of every index per id
  4. The mean of every index per id

In my real case, the indices are named differently so the solution would need to be able to loop through the column names I think.

Thanks

Simon

Community
  • 1
  • 1
SJDS
  • 1,239
  • 1
  • 16
  • 31

1 Answers1

2

If you only need the four items in that list, you should just tabulate:

indcols <- paste0('index',1:3)
lapply(new[,indcols,with=FALSE],table) # counts
lapply(new[,indcols,with=FALSE],function(x)prop.table(table(x))) # means

# or...

lapply(
  new[,indcols,with=FALSE],
  function(x){
    z<-table(x)
    rbind(count=z,mean=prop.table(z))
  })

This gives

$index1
          a     b     c     d     e
count 200.0 200.0 200.0 200.0 200.0
mean    0.2   0.2   0.2   0.2   0.2

$index2
          f     g     h     i     j     k     l     m     n     o
count 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
mean    0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.1

$index3
           p      q      r      s
count 250.00 250.00 250.00 250.00
mean    0.25   0.25   0.25   0.25


The previous approach would work on a data.frame or a data.table, but is rather complicated. With a data.table, one can use the melt syntax:

melt(new, id="id")[,.(
  N=.N, 
  mean=.N/nrow(new)
), by=.(variable,value)]

which gives

    variable value   N mean
 1:   index1     a 200 0.20
 2:   index1     b 200 0.20
 3:   index1     c 200 0.20
 4:   index1     d 200 0.20
 5:   index1     e 200 0.20
 6:   index2     f 100 0.10
 7:   index2     g 100 0.10
 8:   index2     h 100 0.10
 9:   index2     i 100 0.10
10:   index2     j 100 0.10
11:   index2     k 100 0.10
12:   index2     l 100 0.10
13:   index2     m 100 0.10
14:   index2     n 100 0.10
15:   index2     o 100 0.10
16:   index3     p 250 0.25
17:   index3     q 250 0.25
18:   index3     r 250 0.25
19:   index3     s 250 0.25

This approach was mentioned by @Arun in a comment (and implemented by him also, I think..?). To see how it works, first have a look at melt(new, id="id") which transforms the original data.table.

As mentioned in the comments, melting a data.table requires installing and loading reshape2 for some versions of the data.table package.



If you also need the dummies, they can be made in a loop as in the linked question:

newcols <- list()
for (i in indcols){
    vals = unique(new[[i]])
    newcols[[i]] = paste(vals,i,sep='_')
    new[,(newcols[[i]]):=lapply(vals,function(x)get(i)==x)]
}

This stores the groups of columns associated with each variable in newcols for convenience. If you wanted to do the tabulation just with these dummies (instead of the underlying variables as in solution above), you could do

lapply(
  indcols,
  function(i) new[,lapply(.SD,function(x){
    z <- sum(x)
    list(z,z/.N)
  }),.SDcols=newcols[[i]] ])

which gives a similar result. I just wrote it this way to illustrate how data.table syntax can be used. You could again avoid square brackets and .SD here:

lapply(
  indcols,
  function(i) sapply(
    new[, newcols[[i]], with=FALSE],
    function(x){
      z<-sum(x)
      rbind(z,z/length(x))
    }))

But anyway: just use table if you can hold onto the underlying variables.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • Wouldn't it be easier if we *melt* it? i.e., for the first two parts, `melt(new, id="id")[, .(N=.N, mean=.N/nrow(new)), by=.(variable,value)]`. Add `id` to the grouping variable for the last 2..? – Arun Feb 18 '15 at 17:36
  • @Arun: Perhaps. I'm not familiar with the `melt` syntax yet. I'd try it out, but `melt` and `melt.data.table` say "could not find function` (even though `help(melt.data.table)` works). I guess my package installation isn't updated for that yet...? Anyway, I think you would be better positioned to answer with that syntax. Feel free to edit or add another answer. I'm curious to see how it works. – Frank Feb 18 '15 at 17:40
  • 1
    Versions < 1.9.4 requires `reshape2` to be loaded. You don't need it from `1.9.5+`. – Arun Feb 18 '15 at 18:08
  • @Arun Edited. That's very convenient! Please feel free to make further edits. – Frank Feb 18 '15 at 18:22
  • 2
    Cool. `melt` and `cast` are functions from `reshape2` package. We've implemented corresponding data.table functions in C for speed. We imported reshape2 before, but now define generics in data.table (as melt is a S3 generic, but not dcast in reshape2 - dint seem like that'll change). – Arun Feb 18 '15 at 18:27