4

Mock data:

set.seed(1)
df1 <- data.frame(country=c("US", "UK"),
                  year=c(2000, 2003))
df2 <- data.frame(country=rep(c("US", "UK"), 10),
                  year=rep(2000:2009, 2),
                  myvar=rnorm(20))

df1 contains the country-year of interest. I want to get the myvar values for this country-year AND 3 years before and after.

In other words, the merging is done based on condition that df2$country==df1$country AND df2$year > df1$year - 3 & df2$year < df1$year + 3

EDIT: My (working, not elegant) solution is to pad df1 to create all the country-years that I'm interested in, then merge with df2 the regular way.

library(plyr)
ddply(df1, c("country", "year"), 
  function(df) data.frame(rep(df$country, 7), (df$year-3):(df$year+3)))

produces

   country year rep.df.country..7. X.df.year...3...df.year...3.
1       UK 2003                 UK                         2000
2       UK 2003                 UK                         2001
3       UK 2003                 UK                         2002
4       UK 2003                 UK                         2003
5       UK 2003                 UK                         2004
6       UK 2003                 UK                         2005
7       UK 2003                 UK                         2006
8       US 2000                 US                         1997
9       US 2000                 US                         1998
10      US 2000                 US                         1999
11      US 2000                 US                         2000
12      US 2000                 US                         2001
13      US 2000                 US                         2002
14      US 2000                 US                         2003
3442
  • 8,248
  • 2
  • 19
  • 41
Heisenberg
  • 8,386
  • 12
  • 53
  • 102
  • 1
    Don't have time for a full answer now, but will post one later if I manage. Check out the new function `foverlaps` from the package `data.table` if you use them. Used to answer [this question](http://stackoverflow.com/questions/26383944/speed-improvement-for-sapply-along-a-posix-sequence) – ilir Oct 19 '14 at 21:01

3 Answers3

1

Where does merging fit in? This just sounds like a subsetting issue unless I misunderstood the question (as I oft admittedly do)

set.seed(1)
df1 <- data.frame(country=c("US", "UK"),
                  year=c(2000, 2003))
df2 <- data.frame(country=rep(c("US", "UK"), 10),
                  year=rep(2000:2009, 2),
                  myvar=rnorm(20))


f <- lapply(df1$country, function(x) {
  tmp <- df2[df2$country == x, ]
  tmp[abs(tmp$year - df1[df1$country == x, 'year']) <= 3, ]
})


do.call(rbind, f)

#    country year       myvar
# 1       US 2000 -0.62645381
# 3       US 2002 -0.83562861
# 11      US 2000  1.51178117
# 13      US 2002 -0.62124058
# 2       UK 2001  0.18364332
# 4       UK 2003  1.59528080
# 6       UK 2005 -0.82046838
# 12      UK 2001  0.38984324
# 14      UK 2003 -2.21469989
# 16      UK 2005 -0.04493361

EDIT

set.seed(1)
df1 <- data.frame(country=c("US", "UK"),
                  year=c(2000, 2003, 2009, 2009))
df2 <- data.frame(country=rep(c("US", "UK"), 10),
                  year=rep(2000:2009, 2),
                  myvar=rnorm(20))

f <- lapply(seq_len(nrow(df1)), function(x) {
  y <- df1[x, 'country']
  tmp <- df2[df2$country == y, ]
  tmp[abs(tmp$year - df1[x, 'year']) <= 3, ]
})


do.call(rbind, f)

#    country year       myvar
# 1       US 2000 -0.62645381
# 3       US 2002 -0.83562861
# 11      US 2000  1.51178117
# 13      US 2002 -0.62124058
# 2       UK 2001  0.18364332
# 4       UK 2003  1.59528080
# 6       UK 2005 -0.82046838
# 12      UK 2001  0.38984324
# 14      UK 2003 -2.21469989
# 16      UK 2005 -0.04493361
# 7       US 2006  0.48742905
# 9       US 2008  0.57578135
# 17      US 2006 -0.01619026
# 19      US 2008  0.82122120
# 8       UK 2007  0.73832471
# 10      UK 2009 -0.30538839
# 18      UK 2007  0.94383621
# 20      UK 2009  0.59390132
rawr
  • 20,481
  • 4
  • 44
  • 78
  • This works for the mock data, but not in a more general setting where `df1$country` is not unique (because of this part `df1[df1$country == x, 'year']`). Note that my working solution works in that general condition (I think). (It's my bad for posting oversimplified mock data ofc). – Heisenberg Oct 19 '14 at 23:06
  • are there different years for each non unique country? if not, you can just use `unique(df1[...)` – rawr Oct 20 '14 at 00:02
  • Yep, different years for non unique country is the concern – Heisenberg Oct 20 '14 at 00:04
1

A trial using foverlaps in data.table

set.seed(1)
df1 <- data.frame(country=c("US", "UK"),
                  year=c(2000, 2003, 2009, 2009))
df2 <- data.frame(country=rep(c("US", "UK"), 10),
                  year=rep(2000:2009, 2),
                  myvar=rnorm(20))
library(data.table)
setDT(df1); setDT(df2) # convert to data table
df1[, c("start", "end") := list(year-2, year+2)]
setkey(df1, country, start, end)
setkey(df2[, year2:=year], country, year, year2)
foverlaps(df1, df2, type="any")[,4:7:=NULL][]
    country year       myvar
 1:      UK 2001  0.18364332
 2:      UK 2001  0.38984324
 3:      UK 2003  1.59528080
 4:      UK 2003 -2.21469989
 5:      UK 2005 -0.82046838
 6:      UK 2005 -0.04493361
 7:      UK 2007  0.73832471
 8:      UK 2007  0.94383621
 9:      UK 2009 -0.30538839
10:      UK 2009  0.59390132
11:      US 2000 -0.62645381
12:      US 2000  1.51178117
13:      US 2002 -0.83562861
14:      US 2002 -0.62124058
15:      US 2008  0.57578135
16:      US 2008  0.82122120
KFB
  • 3,501
  • 3
  • 15
  • 18
  • I think it must be `year-2` and `year+2` (since `foverlaps` works on closed intervals)? Now I see the reason for your earlier Q about `foverlaps`. Will fix. Thanks. – Arun Oct 20 '14 at 07:50
  • Small note: If keys are set for both tables appropriately, then `by.x` and `by.y` are not necessary. – Arun Oct 20 '14 at 07:51
  • Removed by.x by.y.; y-3 and year+3 should be correct to include concerned year. – KFB Oct 20 '14 at 08:47
  • Again, I don't think so. The condition is `df1$year - 3 < df2$year < df1$year+3`; i.e., `(df1$year-3, df1$year+3)`. But `foverlaps` assumes closed intervals - `[df1$year-3, df1$year+3]` and would therefore also match on the border. I don't think that's required. Also IIUC the `foverlaps` command must be `foverlaps(df2, df1, type="within", nomatch=0L)` - check if `df2` interval lies within df1's interval, `df1` is the *lookup*. – Arun Oct 20 '14 at 10:57
  • Right. There is no = sign. – KFB Oct 20 '14 at 12:07
  • 1
    geez that's ugly. I thought data table was supposed to be elegant :) @Arun – rawr Oct 21 '14 at 14:12
  • @rawr @Arun I've added another `data.table` solution, perhaps more elegant ? – rafa.pereira May 11 '16 at 23:09
0

A perhaps simple solution using data.table

library(data.table) # v1.9.7 (devel version)
# go here for install instructions
# https://github.com/Rdatatable/data.table/wiki/Installation    

# convert datasets into data.table
  setDT(df1)
  setDT(df2)


# create conditional columns in df1
  df1[, yearplus3  :=  year +3 ][, yearminus3 := year - 3 ]

# merge
    output <- df1[df2, on = .(country = country ,                # condition 1
                              yearminus3 < year,                 # condition 2
                              yearplus3  > year), nomatch = 0 ,  # condition 3
                  .(country, year,  myvar )]  # indicate columns in the output


output 
 >   country year       myvar
 >1:      US 2000 -0.62645381
 >2:      UK 2003  0.18364332
 >3:      US 2000 -0.83562861
 >4:      UK 2003  1.59528080
 >5:      UK 2003 -0.82046838
 >6:      US 2000  1.51178117
 >7:      UK 2003  0.38984324
 >8:      US 2000 -0.62124058

ps. note that the argument on = is still in development version of data.table as of today (12 May 2016)

Arun
  • 116,683
  • 26
  • 284
  • 387
rafa.pereira
  • 13,251
  • 6
  • 71
  • 109
  • 1
    You should probably mention that this is currently in development and is subjected to changes.. (non-equi joins part). – Arun May 12 '16 at 07:44