2

Data

I have a "tall" dataset with with repeated observations through time at the unit level, but the intervals are irregular. There are 4e6 observations, so I'd like to keep the solution in data.table.

library(data.table)

d <- data.frame(let = rep(LETTERS[1:10], times = 2),
                num = rep(c(2001,2002), each = 10),
                vals = 1:20,
                lagNum = c(rep(c(NA, 2002), times = 5), rep(NA, 10)) 
                ) # there are missing values in the object 

d[c(1:5, 16:20),]

##    let  num vals lagNum
## 1    A 2001    1     NA
## 2    B 2001    2   2002
## 3    C 2001    3     NA
## 4    D 2001    4   2002
## 5    E 2001    5     NA
## 16   F 2002   16     NA
## 17   G 2002   17     NA
## 18   H 2002   18     NA
## 19   I 2002   19     NA
## 20   J 2002   20     NA

If it helps to anchor the task in the real-world, the units are countries and the repeated observations are years in which the countries hold elections.

Goal

I would like to pass unit-values from vals_{unit, past} to vals_{unit, present}. In a sense, this feels like creating a wider dataset that :shuffles: the tall observations that reoccur in the present into a wider data.table.

I already know this isn't a case for zoo or other time series applications; I figure this isn't any version of a reshape call either, but rather is a merge/join. I also know that this response and this response are moving in the right direction, but aren't getting there.

Running this as a data.frame is pretty trivial:

dMerge <- merge(x = d[!(is.na(d$lagNum)), ],
                y = d[, c("let", "num", "vals")],
                by.x = c("let", "lagNum"),
                by.y = c("let", "num")
                )
dMerge
##   let lagNum  num vals.x vals.y
## 1   B   2002 2001      2     12
## 2   D   2002 2001      4     14
## 3   F   2002 2001      6     16
## 4   H   2002 2001      8     18
## 5   J   2002 2001     10     20

As data.table

My first thought was to duplicate the data.table and assign key values appropriately:

dat <- data.table(let = rep(LETTERS[1:10], times = 2),
                 num = rep(c(2001,2002), each = 10),
                 vals = 1:20,
                 lagNum = c(rep(c(NA, 2002), times = 5), rep(NA, 10))
                 )
setkeyv(dat, cols = c("let", "num"))

dat2 <- dat
setkeyv(dat2, cols = c("let", "lagNum"))

mDat <- merge(dat, dat2)

...and if this had worked I probably would have run with it. But it didn't, and it doesn't seem particularly 'data.table' to be wantonly duplicating data.

Community
  • 1
  • 1
alex
  • 345
  • 2
  • 10

1 Answers1

3

Try

library(data.table)#data.table_1.9.5
dat2 <- copy(dat)
setkeyv(dat2, cols = c("let", "lagNum"))
dat2[dat, nomatch =0][, i.lagNum:= NULL][]
#     let  num vals lagNum i.vals
#1:   B 2001    2   2002     12
#2:   D 2001    4   2002     14
#3:   F 2001    6   2002     16
#4:   H 2001    8   2002     18
#5:   J 2001   10   2002     20

Or you can subset the original dataset and then do the join after setting the keys.

dat2 <- dat[!is.na(lagNum)]
setkey(dat2, let, lagNum)
setkey(dat, let, num)
dat2[dat, nomatch=0][, i.lagNum := NULL][]
#   let  num vals lagNum i.vals
#1:   B 2001    2   2002     12
#2:   D 2001    4   2002     14
#3:   F 2001    6   2002     16
#4:   H 2001    8   2002     18
#5:   J 2001   10   2002     20
akrun
  • 874,273
  • 37
  • 540
  • 662
  • I'm returning a warning: Warning message: `In [.data.table(dat2[dat, nomatch = 0], , :=(i.lagNum, NULL)) : Adding new column 'i.lagNum' then assigning NULL (deleting it).` – alex Mar 18 '15 at 03:58
  • @alex I am using `data.table_1.9.5` the devel version. Which version do you have? I couldn't reproduce any warning with the provided data – akrun Mar 18 '15 at 04:01
  • cran binary 1.9.2; let me update quickly. updated to 1.9.4, no change in the error status though. – alex Mar 18 '15 at 04:02
  • @alex Please check if you get the same warning after installing the devel version. You can install it from https://github.com/Rdatatable/data.table/wiki/Installation – akrun Mar 18 '15 at 04:04