9

I have a dataset (mydata) that contains multiple columns which could fit inside ranges that are stored in another dataset (mycomparison).

I would like to join mycomparison to mydata where the mydata values are within the ranges in mycomparison.

MWE

library(data.table)

mydata<-data.table(
  id=1:5,
  val1=seq(10000, 50000, by=10000),
  val2=floor(rnorm(5,mean=400,sd=100)),
  val3=rnorm(5,mean=.7,sd=.1)
)

mycomparison<-data.table(
  Name=LETTERS[1:3],
  minval1=c(0,30000,10000),
  maxval1=c(50000,80000,30000),
  minval2=c(300,400,300),
  maxval2=c(800,800,800),
  minval3=c(0,.5,.2),
  maxval3=c(1,.9,.8),
  correspondingval=c(.1,.2,.3)
)

Desired Output

> mydata.withmatches
   id  val1 val2      val3 Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1:  1 10000  387 0.4844319    A       0   50000     300     800       0       1              0.1
2:  2 20000  425 0.7856313   NA      NA      NA      NA      NA      NA      NA               NA
3:  3 30000  324 0.8063969   NA      NA      NA      NA      NA      NA      NA               NA
4:  4 40000  263 0.5590113   NA      NA      NA      NA      NA      NA      NA               NA
5:  5 50000  187 0.8764396   NA      NA      NA      NA      NA      NA      NA               NA

Current solution

This feels/is very clunky and involves cross-joining the data (using optiRum::CJ.dt), doing a big logical check, and then reassembling the data.

library(optiRum)

workingdt<-CJ.dt(mydata,mycomparison)

matched<-workingdt[val1>=minval1 &
                     val1<=maxval1 &
                     val2>=minval2 &
                     val2<=maxval2 &
                     val3>=minval3 &
                     val3<=maxval3][which.min(correspondingval)]
notmatched<-mydata[id!= matched[,id]]

all<-list(matched,notmatched)

mydata.withmatches<- rbindlist(all, fill=TRUE, use.names=TRUE)

Looking for a better solution - UPDATED

I'm aware of foverlaps but it will work on a single interval, not on many ranges like in this instance.

I'm hoping for a less clunky and more elegant solution.

Steph Locke
  • 5,951
  • 4
  • 39
  • 77
  • [A previous `foverlaps` Q&A](http://stackoverflow.com/questions/24480031/roll-join-with-start-end-window/25655497#25655497) seems to resemble your case: a single 'point' (you: e.g. `val1`, other Q: `pos`) should be joined with a range (you: e.g. `minval1`/`maxval1`, other Q: `start`/`end`). @arun writes: "Your question is a special case of such an overlap join, where the start and end coordinates are identical", (`val1-val1`; `pos-pos`). @arun's trick there was to create a second `pos` variable (`pos2 := pos`). This may be a first step forward for you as well. – Henrik Feb 16 '15 at 12:21
  • Cheers @Henrik - I'm not sure though that the creation of a lot of extra columns would be particularly elegant - I'll have a crack at a virtual column solution – Steph Locke Feb 16 '15 at 12:45
  • @Henrik on exploration it looks like foverlaps may not be viable when trying to use it over multiple ranges – Steph Locke Feb 16 '15 at 14:02
  • This is old, but aren't there more matches than shown? i.e. id 2 is within Name A as well. – ARobertson Feb 20 '15 at 00:19
  • It'll vary as I put rnorms in to allow for testing repeatedly to ensure no duplicate rows are returned – Steph Locke Feb 20 '15 at 07:46
  • Maybe `intersect(foverlaps(which=T, cas1), foverlaps(which=T, case2), ...)` ? not very friendly but should scale for time and memory – jangorecki Jun 01 '20 at 12:13

2 Answers2

1

I do not exactly understand your Desired Output, because multiple id's match the mycomparison data.table. Using your data (rounded to two decimal places):

> mydata
   id  val1 val2 val3
1:  1 10000  387 0.48
2:  2 20000  425 0.79
3:  3 30000  324 0.81
4:  4 40000  263 0.56
5:  5 50000  187 0.88

And

> mycomparison
   Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1:    A       0   50000     300     800     0.0     1.0              0.1
2:    B   30000   80000     400     800     0.5     0.9              0.2
3:    C   10000   30000     300     800     0.2     0.8              0.3

This gives:

> workingdt
    id  val1 val2 val3 Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
 1:  1 10000  387 0.48    A       0   50000     300     800     0.0     1.0              0.1
 2:  2 20000  425 0.79    A       0   50000     300     800     0.0     1.0              0.1
 3:  3 30000  324 0.81    A       0   50000     300     800     0.0     1.0              0.1
 4:  4 40000  263 0.56    A       0   50000     300     800     0.0     1.0              0.1
 5:  5 50000  187 0.88    A       0   50000     300     800     0.0     1.0              0.1
 6:  1 10000  387 0.48    B   30000   80000     400     800     0.5     0.9              0.2
 7:  2 20000  425 0.79    B   30000   80000     400     800     0.5     0.9              0.2
 8:  3 30000  324 0.81    B   30000   80000     400     800     0.5     0.9              0.2
 9:  4 40000  263 0.56    B   30000   80000     400     800     0.5     0.9              0.2
10:  5 50000  187 0.88    B   30000   80000     400     800     0.5     0.9              0.2
11:  1 10000  387 0.48    C   10000   30000     300     800     0.2     0.8              0.3
12:  2 20000  425 0.79    C   10000   30000     300     800     0.2     0.8              0.3
13:  3 30000  324 0.81    C   10000   30000     300     800     0.2     0.8              0.3
14:  4 40000  263 0.56    C   10000   30000     300     800     0.2     0.8              0.3
15:  5 50000  187 0.88    C   10000   30000     300     800     0.2     0.8              0.3

And leaving off your which.min():

> workingdt[val1>=minval1 & val1<= maxval1 & val2>=minval2 &
            val2<=maxval2 & val3>=minval3 & val3<=maxval3]
   id  val1 val2 val3 Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1:  1 10000  387 0.48    A       0   50000     300     800     0.0     1.0              0.1
2:  2 20000  425 0.79    A       0   50000     300     800     0.0     1.0              0.1
3:  3 30000  324 0.81    A       0   50000     300     800     0.0     1.0              0.1
4:  1 10000  387 0.48    C   10000   30000     300     800     0.2     0.8              0.3
5:  2 20000  425 0.79    C   10000   30000     300     800     0.2     0.8              0.3

If you use the data.table group-by functionality, you can pick the min(correspondingval) for each id (I am leaving off the unmatched data for the moment):

> workingdt[val1>=minval1 & val1<= maxval1 & val2>=minval2 & 
            val2<=maxval2 & val3>=minval3 & val3<=maxval3]
                   [,.SD[which.min(correspondingval)], by=id]
   id  val1 val2 val3 Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1:  1 10000  387 0.48    A       0   50000     300     800       0       1              0.1
2:  2 20000  425 0.79    A       0   50000     300     800       0       1              0.1
3:  3 30000  324 0.81    A       0   50000     300     800       0       1              0.1

Or, the max(correspondingval) if you prefer:

> workingdt[val1>=minval1 & val1<= maxval1 & val2>=minval2 &
            val2<=maxval2 & val3>=minval3 & val3<=maxval3]
                   [,.SD[which.max(correspondingval)], by=id]
   id  val1 val2 val3 Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1:  1 10000  387 0.48    C   10000   30000     300     800     0.2     0.8              0.3
2:  2 20000  425 0.79    C   10000   30000     300     800     0.2     0.8              0.3
3:  3 30000  324 0.81    A       0   50000     300     800     0.0     1.0              0.1

If all you want--as shown in your Desired Output--is the first row with the minimum correspondingval and everything else with NA there are easier ways to do this. If you want to know where each id matches a range--as I have shown in my output--then a cleaner, more elegant solution is different.

Let me know.

Daniel Wisehart
  • 1,489
  • 12
  • 31
0

Nice question! Below you can find my quick fix, but it's also still a bit too cluncky for my taste.

The generated mydata set:

   id  val1 val2      val3
1:  1 10000  377 0.7912443
2:  2 20000  378 0.7709792
3:  3 30000  484 0.7049517
4:  4 40000  513 0.5169590
5:  5 50000  474 0.7987448

A simple function to filter away the non-matching rows of the comparison dataset (multiple rows can match).

library(dplyr)

find_interval_func<-function(var.min.name, var.max.name, value, val.to.return){
  compset<-data.frame(mycomparison)

  for(i in 1:length(var.min.name)){

    compset<-
      compset %>%
      filter_(paste0(var.min.name[[i]], "<=", value[[i]]),
              paste0(var.max.name[[i]], ">", value[[i]]))    
  }

  paste(compset[,val.to.return], collapse="|")

}

And the result:

> mydata %>%
+   group_by(1:n()) %>%
+   mutate(matchedValue = find_interval_func(c("minval1", "minval2", "minval3"),
+                                        c("maxval1", "maxval2", "maxval3"),
+                                        c(val1, val2, val3), 
+                                        "Name"))
Source: local data table [5 x 6]

  id  val1 val2      val3 1:n() matchedValue
1  1 10000  377 0.7912443     1          A|C
2  2 20000  378 0.7709792     2          A|C
3  3 30000  484 0.7049517     3          A|B
4  4 40000  513 0.5169590     4          A|B
5  5 50000  474 0.7987448     5            B
Bart
  • 749
  • 2
  • 9
  • 17