Please see my comment--I'm not sure exactly what you're trying to do, but it seems to have the flavor of a data.table
join. I've copied your tables as data.table
s so that:
> d1
chr low high sthg mid
1: chr1 242601432 242601632 KLF4 242601532
2: chr11 85912879 85913079 CMYC 85912979
3: chr14 86369800 86370000 SOX2 86369900
4: chr3 158289024 158289224 CMYC 158289124
5: chr3 123709437 123709637 SOX2 123709537
6: chr6 8397251 8397451 SOX2 8397351
> d2
chr range.low range.high
1: chr1 11323785 11617177
2: chr1 12645605 13926923
3: chr1 14750216 15119039
4: chr1 18102157 19080189
5: chr1 29491029 30934636
6: chr1 33716472 35395979
And I've also done
setkey(d1,chr)
setkey(d2,chr)
Now, I can join these on the chr
column, so when chr
matches, you'll see each range:
> d2[d1]
chr range.low range.high low high sthg mid
1: chr1 11323785 11617177 242601432 242601632 KLF4 242601532
2: chr1 12645605 13926923 242601432 242601632 KLF4 242601532
3: chr1 14750216 15119039 242601432 242601632 KLF4 242601532
4: chr1 18102157 19080189 242601432 242601632 KLF4 242601532
5: chr1 29491029 30934636 242601432 242601632 KLF4 242601532
6: chr1 33716472 35395979 242601432 242601632 KLF4 242601532
7: chr11 NA NA 85912879 85913079 CMYC 85912979
8: chr14 NA NA 86369800 86370000 SOX2 86369900
9: chr3 NA NA 158289024 158289224 CMYC 158289124
10: chr3 NA NA 123709437 123709637 SOX2 123709537
11: chr6 NA NA 8397251 8397451 SOX2 8397351
Now you can use a simple data.table
operation to make a single pass through and identify where the median point falls in the range:
d <- d2[d1]
d[!is.na(range.low+range.high),
falls.in.range:=(range.low <= mid & mid <= range.high)]
d
chr range.low range.high low high sthg mid falls.in.range
1: chr1 11323785 11617177 242601432 242601632 KLF4 242601532 FALSE
2: chr1 12645605 13926923 242601432 242601632 KLF4 242601532 FALSE
3: chr1 14750216 15119039 242601432 242601632 KLF4 242601532 FALSE
4: chr1 18102157 19080189 242601432 242601632 KLF4 242601532 FALSE
5: chr1 29491029 30934636 242601432 242601632 KLF4 242601532 FALSE
6: chr1 33716472 35395979 242601432 242601632 KLF4 242601532 FALSE
7: chr11 NA NA 85912879 85913079 CMYC 85912979 NA
8: chr14 NA NA 86369800 86370000 SOX2 86369900 NA
9: chr3 NA NA 158289024 158289224 CMYC 158289124 NA
10: chr3 NA NA 123709437 123709637 SOX2 123709537 NA
11: chr6 NA NA 8397251 8397451 SOX2 8397351 NA
Not a beautiful example since none of the chr1
cases seem to meet the condition, but hopefully this gets the point across.
The key thing to note is that data.table
joins are insanely fast, so if you choose your join columns correctly you should be able to take advantage of a fast join even on a large table and then make a single pass through this large table. You may need to consider a cross join depending on the particular problem. (See also: ?CJ
and possibly allow.cartesian
in ?data.table
.)
Edit if you really mean you want to know for every range whether each midpoint falls in the range, then yes, you are in cross-join territory. Note this means that you essentially consider the "chr1"-style and "KLF4"-style columns to be extraneous to the question. In this case I might do something like this:
d1[,observation.ID:=.I]
setkey(d1,observation.ID)
d2[,range.ID:=.I]
setkey(d2,range.ID)
d <- CJ(observation.ID=d1[,observation.ID],range.ID=d2[,range.ID])
setkey(d,observation.ID)
d[d1,mid:=i.mid]
setkey(d,range.ID)
d[d2,c("range.low","range.high"):=.(i.range.low,i.range.high)]
d[,falls.in.range:=range.low <= mid & mid <= range.high]
> d
observation.ID range.ID mid range.low range.high falls.in.range
1: 1 1 242601532 11323785 11617177 FALSE
2: 2 1 85912979 11323785 11617177 FALSE
3: 3 1 86369900 11323785 11617177 FALSE
4: 4 1 158289124 11323785 11617177 FALSE
5: 5 1 123709537 11323785 11617177 FALSE
6: 6 1 8397351 11323785 11617177 FALSE
7: 1 2 242601532 12645605 13926923 FALSE
8: 2 2 85912979 12645605 13926923 FALSE
9: 3 2 86369900 12645605 13926923 FALSE
10: 4 2 158289124 12645605 13926923 FALSE
11: 5 2 123709537 12645605 13926923 FALSE
12: 6 2 8397351 12645605 13926923 FALSE
13: 1 3 242601532 14750216 15119039 FALSE
14: 2 3 85912979 14750216 15119039 FALSE
15: 3 3 86369900 14750216 15119039 FALSE
16: 4 3 158289124 14750216 15119039 FALSE
17: 5 3 123709537 14750216 15119039 FALSE
18: 6 3 8397351 14750216 15119039 FALSE
19: 1 4 242601532 18102157 19080189 FALSE
20: 2 4 85912979 18102157 19080189 FALSE
21: 3 4 86369900 18102157 19080189 FALSE
22: 4 4 158289124 18102157 19080189 FALSE
23: 5 4 123709537 18102157 19080189 FALSE
24: 6 4 8397351 18102157 19080189 FALSE
25: 1 5 242601532 29491029 30934636 FALSE
26: 2 5 85912979 29491029 30934636 FALSE
27: 3 5 86369900 29491029 30934636 FALSE
28: 4 5 158289124 29491029 30934636 FALSE
29: 5 5 123709537 29491029 30934636 FALSE
30: 6 5 8397351 29491029 30934636 FALSE
31: 1 6 242601532 33716472 35395979 FALSE
32: 2 6 85912979 33716472 35395979 FALSE
33: 3 6 86369900 33716472 35395979 FALSE
34: 4 6 158289124 33716472 35395979 FALSE
35: 5 6 123709537 33716472 35395979 FALSE
36: 6 6 8397351 33716472 35395979 FALSE
(You can join the other detail columns on after the fact, e.g. setkey(d,observation.ID);setkey(d1,observation.ID);d[d1,sthg:=i.sthg]
to get the "KLF4" column as I've named it.) But note that this probably isn't going to save a ton of time; if you're doing a full check of all midpoints against all ranges, then the speed-up is pretty much only in the better-vectorized data.table
expression versus your nested for loops. So I'm not sure if this will be much better for your large table. Maybe try it and report back?
Update re typo: see the example below for a comparison of &&
(incorrect in this case) and &
(correct in this case). &&
, as you point out, only evaluates the first element of vectors, while &
compares across vectors and returns a vector. So the output of &&
is recycled, yielding incorrect results when you mean to compare row-by-row:
> d1[,using.double.and:=low < mid && mid==242601532]
> d1[,using.single.and:=low < mid & mid==242601532]
> d1
chr low high sthg mid observation.ID using.double.and using.single.and
1: chr1 242601432 242601632 KLF4 242601532 1 TRUE TRUE
2: chr11 85912879 85913079 CMYC 85912979 2 TRUE FALSE
3: chr14 86369800 86370000 SOX2 86369900 3 TRUE FALSE
4: chr3 158289024 158289224 CMYC 158289124 4 TRUE FALSE
5: chr3 123709437 123709637 SOX2 123709537 5 TRUE FALSE
6: chr6 8397251 8397451 SOX2 8397351 6 TRUE FALSE