3

I'm trying to get an idea of the availability of my data which might look like:

DT <- data.table(id=rep(c("a","b"),each=20),time=rep(1991:2010,2),
             x=rbeta(40,shape1=1,shape2=2),
             y=rnorm(40))
#I have some NA's (no gaps):
DT[id=="a"&time<2000,x:=NA]
DT[id=="b"&time>2005,y:=NA]

but is much larger of course. Ideally, I'd like to see a table like this:

       a         b
x 2000-2010  1991-2010
y 1991-2010  1991-2005

so the non-missing minimum to the non-missing maximun time period. I can get that for one variable:

DT[,availability_x:=paste0(
        as.character(min(ifelse(!is.na(x),time,NA),na.rm=T)),
        "-",
        as.character(max(ifelse(!is.na(x),time,NA),na.rm=T))),
    by=id]

But in reality, I want to do that for many variables. All my attempts to do that fail, however, because I'm having a hard time communicating a vector of columns to the data table. My guess is that it goes in the direction of this or this but my attempts to adapt these solutions to a vector of columns failed.

An apply function for example doesn't seem to evaluate the elements of a character vector:

cols <- c("x","y")

availabilityfunction <- function(i){
  DT[,paste0("avail_",i):=paste0(
       as.character(min(ifelse(!is.na(i),time,NA),na.rm=T)),
       "-",
       as.character(max(ifelse(!is.na(i),time,NA),na.rm=T))),
    by=id]}
lapply(cols,availabilityfunction)
Community
  • 1
  • 1
Jakob
  • 1,325
  • 15
  • 31
  • 2
    An alternative: `melt(DT, id = c("id", "time"))[!is.na(value), as.list(summary(time)), by=.(id, variable)]` or `range` in place of `summary`. – Frank Apr 18 '17 at 16:21
  • that's also nice and helps my understanding. thanks! – Jakob Apr 18 '17 at 16:48

1 Answers1

3

We can loop (lapply) through the columns of interest specified in .SDcols after grouping by 'id', create a logical index of non-NA elements (!is.na), find the numeric index (which), get the range (i.e. min and max), use that to subset the 'time' column and paste the time elements together.

DT[, lapply(.SD, function(x) paste(time[range(which(!is.na(x)))], 
                      collapse="-")), by = id, .SDcols = x:y]
#   id         x         y
#1:  a 2000-2010 1991-2010
#2:  b 1991-2010 1991-2005
akrun
  • 874,273
  • 37
  • 540
  • 662
  • that looks really good. But I get the following error: `Error in time[range(which(!is.na(x)))] : object of type 'closure' is not subsettable` which I usually get when I look for something that doesn't exist. Maybe its the double naming of x in the function and the data.table? – Jakob Apr 18 '17 at 16:53
  • @PeterPan Is this error based on the example you showed? If that is the case, it is working fine for me. I am using `data.table_1.10.4` – akrun Apr 18 '17 at 16:58
  • Probably you don't have a column named `time` where you're testing the code, so it's finding the function of that name instead @PeterPan – Frank Apr 18 '17 at 17:01
  • @akrun yes but with `data.table 1.9.6` and an update error, hmphf... I'll get back to you when I fixed this. Thanks so far! – Jakob Apr 18 '17 at 17:02
  • @Frank: thanks for sticking around! I change time to year without an effect. Also change the variable name (in case it gets confused with function(x) because the variable is called x) – Jakob Apr 18 '17 at 17:05
  • @PeterPan That could explain the error as we are also using a variable 'time' that is not included in `.SDcols` (which I think is a recent update in one of the versions) – akrun Apr 18 '17 at 17:09
  • 2
    @akrun Yes! Thats it. adding time to an object like `cols <- c("x","y","time")` and changing your last argument to `.SDcols = cols` solves the problem. Thanks! – Jakob Apr 18 '17 at 17:14