1

This isn't a dupe of this. That question deals with rows which already have NAs in them, my question deals with missing rows for which there should be a data point of 0.

Let's say I have this data.table

dt<-data.table(id=c(1,2,4,5,6,1,3,4,5,6),
           varname=c(rep('banana',5),rep('apple',5)),
            thedata=runif(10,1,10))

What's the best way to add, for each varname, the missing ids with a 0 for thedata?

At the moment I dcast with fill=0 and then melt again but this doesn't seem very efficient.

melt(dcast.data.table(dt,id~varname,value.var='thedata',fill=0),id.var='id',variable.factor=FALSE,variable.name='varname',value.name='thedata')

I also just thought of doing it this way but it gets a little clunky to fill in NAs at the end

merge(dt[,CJ(id=unique(id),varname=unique(varname))],dt,by=c('varname','id'),all=TRUE)[,.(varname,id,thedata=ifelse(!is.na(thedata),thedata,0))]

In this example, I only used one id column but any additional suggestion should be extensible to having more than one id column.

EDIT

I did a system.time on each approach with a largish data set and the melt/cast approach took between 2-3 seconds while the merge/CJ approach took between 12-13.

EDIT2

Roland's CJ approach is much better than mine as it only took between 4-5 seconds with my dataset.

Is there a better way to do this?

Community
  • 1
  • 1
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
  • if you really need a cartesian product of two vectors (all combinations of two vectors) then I would expect the `CJ` to be the most **scalable** solution - which may not be fastest on your data, but if you would have an `Inf` rows then it should be fastest :) – jangorecki Mar 11 '16 at 20:37
  • Show us your code to replicate the benchmark please. `dcast` internally uses `CJ`... – Arun Mar 12 '16 at 07:38

1 Answers1

4
setkey(dt, varname, id)
dt[CJ(unique(varname), unique(id))]
#    id varname  thedata
# 1:  1   apple 9.083738
# 2:  2   apple       NA
# 3:  3   apple 7.332652
# 4:  4   apple 3.610315
# 5:  5   apple 7.113414
# 6:  6   apple 9.046398
# 7:  1  banana 3.973751
# 8:  2  banana 9.907012
# 9:  3  banana       NA
#10:  4  banana 9.308346
#11:  5  banana 1.572314
#12:  6  banana 7.753611

Then substitute NA with 0 if you must (usually not appropriate).

Jaap
  • 81,064
  • 34
  • 182
  • 193
Roland
  • 127,288
  • 10
  • 191
  • 288
  • I looked closer and realized your way is different from my CJ. It is however still not as fast as melt/dcasting. – Dean MacGregor Mar 11 '16 at 15:32
  • 1
    You should share code that creates larger data for benchmarking. – Roland Mar 11 '16 at 15:34
  • Honestly, the time required isn't that important because it isn't a transformation I need to do often enough for 2sec vs 4sec to ever matter. I'll leave the question open just in case there's some other novel idea that someone comes up with otherwise I'll accept your answer after a little while. – Dean MacGregor Mar 11 '16 at 16:04
  • 3
    On the latest cran version of the package, you can do `CJ(varname, id, unique=TRUE)` – Frank Mar 11 '16 at 18:04