I have two data frames of co-ordinates. Each data frame has two 'set' co-ordinates and one co-ordinate which is a range (represented by two columns of the start and end of the range). The actual data frames are very large, ~40,000 rows. Here is some dummy data:
hdata<-data.frame(distance=c(1:12),x=c(1,1,1,1,1,1,2,2,2,2,2,2),z=c(1,1,1,2,2,2,1,1,1,2,2,2),
ystart=c(0.5,3,3,3,3,1.5,3,3,3,1.5,1.5,0.5),yend=c(1.5,4,4,4,4,2.5,4,4,4,2.5,2.5,1.5))
vdata<-data.frame(distance=c(1:12),x=c(1,1,1,1,1,1,2,2,2,2,2,2),y=c(1,1,1,2,2,2,1,1,1,2,2,2),
zstart=c(0.5,3,1.5,3,3,3,3,3,1.5,1.5,1.5,3),zend=c(1.5,4,2.5,4,4,4,4,4,2.5,2.5,2.5,4))
> vdata
# distance x z ystart yend
#1 1 1 1 0.5 1.5
#2 2 1 1 3.0 4.0
#3 3 1 1 3.0 4.0
#4 4 1 2 3.0 4.0
#5 5 1 2 3.0 4.0
#6 6 1 2 1.5 2.5
#7 7 2 1 3.0 4.0
#8 8 2 1 3.0 4.0
#9 9 2 1 3.0 4.0
#10 10 2 2 1.5 2.5
#11 11 2 2 1.5 2.5
#12 12 2 2 0.5 1.5
> hdata
# distance x y zstart zend
#1 1 1 1 0.5 1.5
#2 2 1 1 3.0 4.0
#3 3 1 1 1.5 2.5
#4 4 1 2 3.0 4.0
#5 5 1 2 3.0 4.0
#6 6 1 2 3.0 4.0
#7 7 2 1 3.0 4.0
#8 8 2 1 3.0 4.0
#9 9 2 1 1.5 2.5
#10 10 2 2 1.5 2.5
#11 11 2 2 1.5 2.5
#12 12 2 2 3.0 4.0
I want to find rows where the co-ordinates overlap. So for instance, a hit would be row 1 of vdata with row 1 of hdata, because both have x = 1, vdata's z co-ordinate falls within the z range of hdata, and hdata's y co-ordinate falls within the y range of vdata.
> vdata[1,]
distance x z ystart yend
1 1 1 1 0.5 1.5
> hdata[1,]
distance x y zstart zend
1 1 1 1 0.5 1.5
The correct output for this dummy dataset should be this:
> results
vdistance hdistance x ystart yend zstart zend
1 1 1 1 0.5 1.5 0.5 1.5
2 12 9 2 0.5 1.5 1.5 2.5
3 10 10 2 1.5 2.5 1.5 2.5
4 11 10 2 1.5 2.5 1.5 2.5
5 10 11 2 1.5 2.5 1.5 2.5
6 11 11 2 1.5 2.5 1.5 2.5
I made a very slow and complicated bunch of nested for loops and if / else if statements to try to sort these out. It takes way way too long for my massive dataset. I tried to make it faster by splitting the dataframes by x and y and by x and z and then checking only the first x co-ordinate of each frame, and by ordering by the ystart and zstart columns and then stopping once the z or y went out of range but it's still too slow.
Any ideas on a better approach for this?