Merge two data frame where one column is matched based on condition

Question

Mock data:

set.seed(1)
df1 <- data.frame(country=c("US", "UK"),
                  year=c(2000, 2003))
df2 <- data.frame(country=rep(c("US", "UK"), 10),
                  year=rep(2000:2009, 2),
                  myvar=rnorm(20))

df1 contains the country-year of interest. I want to get the myvar values for this country-year AND 3 years before and after.

In other words, the merging is done based on condition that df2$country==df1$country AND df2$year > df1$year - 3 & df2$year < df1$year + 3

EDIT: My (working, not elegant) solution is to pad df1 to create all the country-years that I'm interested in, then merge with df2 the regular way.

library(plyr)
ddply(df1, c("country", "year"), 
  function(df) data.frame(rep(df$country, 7), (df$year-3):(df$year+3)))

produces

   country year rep.df.country..7. X.df.year...3...df.year...3.
1       UK 2003                 UK                         2000
2       UK 2003                 UK                         2001
3       UK 2003                 UK                         2002
4       UK 2003                 UK                         2003
5       UK 2003                 UK                         2004
6       UK 2003                 UK                         2005
7       UK 2003                 UK                         2006
8       US 2000                 US                         1997
9       US 2000                 US                         1998
10      US 2000                 US                         1999
11      US 2000                 US                         2000
12      US 2000                 US                         2001
13      US 2000                 US                         2002
14      US 2000                 US                         2003

Don't have time for a full answer now, but will post one later if I manage. Check out the new function `foverlaps` from the package `data.table` if you use them. Used to answer [this question](http://stackoverflow.com/questions/26383944/speed-improvement-for-sapply-along-a-posix-sequence) — ilir, Oct 19 '14 at 21:01

rawr · Answer 1 · 2014-10-20T00:19:36.123

Where does merging fit in? This just sounds like a subsetting issue unless I misunderstood the question (as I oft admittedly do)

set.seed(1)
df1 <- data.frame(country=c("US", "UK"),
                  year=c(2000, 2003))
df2 <- data.frame(country=rep(c("US", "UK"), 10),
                  year=rep(2000:2009, 2),
                  myvar=rnorm(20))


f <- lapply(df1$country, function(x) {
  tmp <- df2[df2$country == x, ]
  tmp[abs(tmp$year - df1[df1$country == x, 'year']) <= 3, ]
})


do.call(rbind, f)

#    country year       myvar
# 1       US 2000 -0.62645381
# 3       US 2002 -0.83562861
# 11      US 2000  1.51178117
# 13      US 2002 -0.62124058
# 2       UK 2001  0.18364332
# 4       UK 2003  1.59528080
# 6       UK 2005 -0.82046838
# 12      UK 2001  0.38984324
# 14      UK 2003 -2.21469989
# 16      UK 2005 -0.04493361

EDIT

set.seed(1)
df1 <- data.frame(country=c("US", "UK"),
                  year=c(2000, 2003, 2009, 2009))
df2 <- data.frame(country=rep(c("US", "UK"), 10),
                  year=rep(2000:2009, 2),
                  myvar=rnorm(20))

f <- lapply(seq_len(nrow(df1)), function(x) {
  y <- df1[x, 'country']
  tmp <- df2[df2$country == y, ]
  tmp[abs(tmp$year - df1[x, 'year']) <= 3, ]
})


do.call(rbind, f)

#    country year       myvar
# 1       US 2000 -0.62645381
# 3       US 2002 -0.83562861
# 11      US 2000  1.51178117
# 13      US 2002 -0.62124058
# 2       UK 2001  0.18364332
# 4       UK 2003  1.59528080
# 6       UK 2005 -0.82046838
# 12      UK 2001  0.38984324
# 14      UK 2003 -2.21469989
# 16      UK 2005 -0.04493361
# 7       US 2006  0.48742905
# 9       US 2008  0.57578135
# 17      US 2006 -0.01619026
# 19      US 2008  0.82122120
# 8       UK 2007  0.73832471
# 10      UK 2009 -0.30538839
# 18      UK 2007  0.94383621
# 20      UK 2009  0.59390132

This works for the mock data, but not in a more general setting where `df1$country` is not unique (because of this part `df1[df1$country == x, 'year']`). Note that my working solution works in that general condition (I think). (It's my bad for posting oversimplified mock data ofc). — Heisenberg, Oct 19 '14 at 23:06
are there different years for each non unique country? if not, you can just use `unique(df1[...)` — rawr, Oct 20 '14 at 00:02

KFB · Answer 2 · 2014-10-20T12:09:27.083

1

A trial using foverlaps in data.table

set.seed(1)
df1 <- data.frame(country=c("US", "UK"),
                  year=c(2000, 2003, 2009, 2009))
df2 <- data.frame(country=rep(c("US", "UK"), 10),
                  year=rep(2000:2009, 2),
                  myvar=rnorm(20))
library(data.table)
setDT(df1); setDT(df2) # convert to data table
df1[, c("start", "end") := list(year-2, year+2)]
setkey(df1, country, start, end)
setkey(df2[, year2:=year], country, year, year2)
foverlaps(df1, df2, type="any")[,4:7:=NULL][]
    country year       myvar
 1:      UK 2001  0.18364332
 2:      UK 2001  0.38984324
 3:      UK 2003  1.59528080
 4:      UK 2003 -2.21469989
 5:      UK 2005 -0.82046838
 6:      UK 2005 -0.04493361
 7:      UK 2007  0.73832471
 8:      UK 2007  0.94383621
 9:      UK 2009 -0.30538839
10:      UK 2009  0.59390132
11:      US 2000 -0.62645381
12:      US 2000  1.51178117
13:      US 2002 -0.83562861
14:      US 2002 -0.62124058
15:      US 2008  0.57578135
16:      US 2008  0.82122120

edited Oct 20 '14 at 12:09

answered Oct 19 '14 at 22:07

KFB

3,501
3
15
18

I think it must be `year-2` and `year+2` (since `foverlaps` works on closed intervals)? Now I see the reason for your earlier Q about `foverlaps`. Will fix. Thanks. – Arun Oct 20 '14 at 07:50
Small note: If keys are set for both tables appropriately, then `by.x` and `by.y` are not necessary. – Arun Oct 20 '14 at 07:51
Removed by.x by.y.; y-3 and year+3 should be correct to include concerned year. – KFB Oct 20 '14 at 08:47
Again, I don't think so. The condition is `df1$year - 3 < df2$year < df1$year+3`; i.e., `(df1$year-3, df1$year+3)`. But `foverlaps` assumes closed intervals - `[df1$year-3, df1$year+3]` and would therefore also match on the border. I don't think that's required. Also IIUC the `foverlaps` command must be `foverlaps(df2, df1, type="within", nomatch=0L)` - check if `df2` interval lies within df1's interval, `df1` is the *lookup*. – Arun Oct 20 '14 at 10:57
Right. There is no = sign. – KFB Oct 20 '14 at 12:07
1

geez that's ugly. I thought data table was supposed to be elegant :) @Arun – rawr Oct 21 '14 at 14:12
@rawr @Arun I've added another `data.table` solution, perhaps more elegant ? – rafa.pereira May 11 '16 at 23:09

score 0 · Answer 3 · edited May 12 '16 at 12:19

A perhaps simple solution using data.table

library(data.table) # v1.9.7 (devel version)
# go here for install instructions
# https://github.com/Rdatatable/data.table/wiki/Installation    

# convert datasets into data.table
  setDT(df1)
  setDT(df2)


# create conditional columns in df1
  df1[, yearplus3  :=  year +3 ][, yearminus3 := year - 3 ]

# merge
    output <- df1[df2, on = .(country = country ,                # condition 1
                              yearminus3 < year,                 # condition 2
                              yearplus3  > year), nomatch = 0 ,  # condition 3
                  .(country, year,  myvar )]  # indicate columns in the output


output 
 >   country year       myvar
 >1:      US 2000 -0.62645381
 >2:      UK 2003  0.18364332
 >3:      US 2000 -0.83562861
 >4:      UK 2003  1.59528080
 >5:      UK 2003 -0.82046838
 >6:      US 2000  1.51178117
 >7:      UK 2003  0.38984324
 >8:      US 2000 -0.62124058

ps. note that the argument on = is still in development version of data.table as of today (12 May 2016)

You should probably mention that this is currently in development and is subjected to changes.. (non-equi joins part). — Arun, May 12 '16 at 07:44

Merge two data frame where one column is matched based on condition

3 Answers3

Linked