Using 'match' with data.table

Question

In the data.table below, individuals have names given in p1. Each of these individuals have incomes given by inc_1 generated as follow:

  data_gen = function(){
  p_names = letters[1:10]
  dataset = data.table(p1 = c(sample(p_names,10,replace=F),"y"), p2 = c(sample(p_names,10,replace=F),"z"), inc_1 = round(rnorm(11,1000,200)))
  return(dataset)
}

set.seed(43210)
data_1 = data_gen()
data_1

Each individual p1 is closely related to individuals listed in p2 and I am interested in having the income of p2 listed in a new column inc_2 just rigth to inc_1. The "match" command is useful for achieving this aim

data_2 = data_1 # saved for latter use
data_1$inc_2 = data_1$inc_1[match(data_1$p2,data_1$p1,nomatch = NA)]
data_1

In data_1, we see the income inc_2 of p2="i" listed just right to inc_1 of p1="b" and so on... However, with new dimension in the dataset, the year, I am not able to generate the partner p2 income inc_2 correctly over years.

set.seed(43211)
data_3 = data_gen()
data_4 = rbind(cbind(year=rep(2015,11),data_2),cbind(year=rep(2016,11),data_3))
data_4

If we reproduce the same code as before, then 'match' misses the time dimension and does not return for 2016 and p1="g" the income inc_2 of p2="h" for the year 2016, but instead the 2015 income of "h"

data_4$inc_2 = data_4$inc_1[match(data_4$p2,data_4$p1,nomatch = NA)]
data_4

I thought that adding by=c('year') would solve the problem, but none of the line below generates inc_2 properly

data_4[ , inc_1[match(p2,p1,nomatch = NA)],by=c('year')] # close too, but v2 is not included in data_4
data_4[ , inc_2 = inc_1[match(p2,p1,nomatch = NA)],by=c('year')]
data_4$inc_2 = data_4[ , inc_1[match(p2,p1,nomatch = NA)],by=c('year')]

I would appreciate any comment on this point...

It is easier for others to help you when include some [reproducible example data](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). — Jaap, Feb 09 '18 at 14:23
Sorry, I forgot inserting the first few lines: Here they are: data = function(){ p_names = letters[1:10] dataset = data.table(p1 = c(sample(p_names,10,replace=F),"y"), p2 = c(sample(p_names,10,replace=F),"z"), inc_1 = round(rnorm(11,1000,200))) return(dataset) } — Bertrand, Feb 09 '18 at 14:54
I think it is better to name your function differently. R already has a function called `data`. See `?data`. — Jaap, Feb 09 '18 at 15:00
The "match" command is useful for achieving this aim -- Is it? An idiomatic approach to your first stated problem is `data_1[, inc_2 := .SD[.SD, on=.(p1 = p2), x.inc_1]]`, not match, I guess. You might want to read through the data.table vignettes. — Frank, Feb 09 '18 at 15:27
Thank you Frank, your suggestion is working fine, also with several years included: data_4[, inc_2 := .SD[.SD, on=.(p1 = p2), x.inc_1], by='year'] — Bertrand, Feb 09 '18 at 19:54

Using 'match' with data.table

0 Answers0