2

I have two data frames of co-ordinates. Each data frame has two 'set' co-ordinates and one co-ordinate which is a range (represented by two columns of the start and end of the range). The actual data frames are very large, ~40,000 rows. Here is some dummy data:

    hdata<-data.frame(distance=c(1:12),x=c(1,1,1,1,1,1,2,2,2,2,2,2),z=c(1,1,1,2,2,2,1,1,1,2,2,2),
              ystart=c(0.5,3,3,3,3,1.5,3,3,3,1.5,1.5,0.5),yend=c(1.5,4,4,4,4,2.5,4,4,4,2.5,2.5,1.5))
vdata<-data.frame(distance=c(1:12),x=c(1,1,1,1,1,1,2,2,2,2,2,2),y=c(1,1,1,2,2,2,1,1,1,2,2,2),
              zstart=c(0.5,3,1.5,3,3,3,3,3,1.5,1.5,1.5,3),zend=c(1.5,4,2.5,4,4,4,4,4,2.5,2.5,2.5,4))


> vdata
#   distance x z ystart yend
#1         1 1 1    0.5  1.5
#2         2 1 1    3.0  4.0
#3         3 1 1    3.0  4.0
#4         4 1 2    3.0  4.0
#5         5 1 2    3.0  4.0
#6         6 1 2    1.5  2.5
#7         7 2 1    3.0  4.0
#8         8 2 1    3.0  4.0
#9         9 2 1    3.0  4.0
#10       10 2 2    1.5  2.5
#11       11 2 2    1.5  2.5
#12       12 2 2    0.5  1.5

> hdata
#   distance x y zstart zend
#1         1 1 1    0.5  1.5
#2         2 1 1    3.0  4.0
#3         3 1 1    1.5  2.5
#4         4 1 2    3.0  4.0
#5         5 1 2    3.0  4.0
#6         6 1 2    3.0  4.0
#7         7 2 1    3.0  4.0
#8         8 2 1    3.0  4.0
#9         9 2 1    1.5  2.5
#10       10 2 2    1.5  2.5
#11       11 2 2    1.5  2.5
#12       12 2 2    3.0  4.0

I want to find rows where the co-ordinates overlap. So for instance, a hit would be row 1 of vdata with row 1 of hdata, because both have x = 1, vdata's z co-ordinate falls within the z range of hdata, and hdata's y co-ordinate falls within the y range of vdata.

> vdata[1,]
  distance x z ystart yend
1        1 1 1    0.5  1.5
> hdata[1,]
  distance x y zstart zend
1        1 1 1    0.5  1.5

The correct output for this dummy dataset should be this:

> results
   vdistance hdistance x ystart yend zstart zend
1          1         1 1    0.5  1.5    0.5  1.5
2         12         9 2    0.5  1.5    1.5  2.5
3         10        10 2    1.5  2.5    1.5  2.5
4         11        10 2    1.5  2.5    1.5  2.5
5         10        11 2    1.5  2.5    1.5  2.5
6         11        11 2    1.5  2.5    1.5  2.5

I made a very slow and complicated bunch of nested for loops and if / else if statements to try to sort these out. It takes way way too long for my massive dataset. I tried to make it faster by splitting the dataframes by x and y and by x and z and then checking only the first x co-ordinate of each frame, and by ordering by the ystart and zstart columns and then stopping once the z or y went out of range but it's still too slow.

Any ideas on a better approach for this?

OxL
  • 85
  • 1
  • 7
  • A non-equi join in the development version (1.9.7) of `data.table` will do this. See item 3 in ['new features'](https://github.com/Rdatatable/data.table/blob/master/NEWS.md) – SymbolixAU Oct 05 '16 at 01:56
  • Also, [this question](http://stackoverflow.com/q/35565149/5977215) has some examples of how to do this, and [Arun's answer](http://stackoverflow.com/a/38663460/5977215) shows non-equi joins in action (he's a co-developer of data.table) – SymbolixAU Oct 05 '16 at 01:58
  • Thanks for the useful links Symbolix! – OxL Oct 06 '16 at 19:38

1 Answers1

2

Consider a merge with a conditional filter:

mdf <- merge(hdata, vdata, by="x")

finaldf <- mdf[(mdf$z >= mdf$zstart & mdf$z <= mdf$zend) &
               (mdf$y >= mdf$ystart & mdf$y <= mdf$yend),]  

rownames(finaldf) <- seq(nrow(finaldf))
colnames(finaldf) <- c("x", "hdistance", "z", "ystart", "yend", 
                            "vdistance", "y", "zstart", "zend")
finaldf <- finaldf[c("hdistance", "vdistance", "x", 
                     "ystart", "yend", "zstart", "zend")]

finaldf
#   hdistance vdistance x ystart yend zstart zend
# 1         1         1 1    0.5  1.5    0.5  1.5
# 2        10        10 2    1.5  2.5    1.5  2.5
# 3        10        11 2    1.5  2.5    1.5  2.5
# 4        11        10 2    1.5  2.5    1.5  2.5
# 5        11        11 2    1.5  2.5    1.5  2.5
# 6        12         9 2    0.5  1.5    1.5  2.5
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Thank you very much! It worked well for my test data set. When I tried it on the 40,000 row dataframes it got stuck on the merge step which was very slow... however I realised that I actually have concatenated my data too early and it works fine when I run it on the smaller subsets (~6,000 rows). (The merge step takes ~4 seconds and the rest takes ~1 second) – OxL Oct 05 '16 at 21:00
  • Fantastic! Always love to hear how the solution works in real-life practice. Glad I could help. – Parfait Oct 05 '16 at 21:33