Generate a dataframe of values that fall within a large series of ranges

Question

I am taking a series of medians and checking to see if they are in-between multiple ranges, and then storing the medians that do match as well as a label associated with them. This code works, but the files are far too large for this iterative approach. Is there a faster way to make these comparisons and record the matches in a dataframe?

The structure of the tfFile is:

    V1        V2        V3   V4    Center_Point
1  chr3 158289024 158289224 CMYC    158289124
2  chr1 242601432 242601632 KLF4    242601532
3 chr11  85912879  85913079 CMYC     85912979
4 chr14  86369800  86370000 SOX2     86369900
5  chr6   8397251   8397451 SOX2      8397351
6  chr3 123709437 123709637 SOX2    123709537

The structure of the ranges is:

   V1       V2       V3        
1 chr1 11323785 11617177 
2 chr1 12645605 13926923 
3 chr1 14750216 15119039
4 chr1 18102157 19080189 
5 chr1 29491029 30934636 
6 chr1 33716472 35395979

Here's a look at the code:

tfFile = read.table("medianfile.txt", sep= "", stringsAsFactors=FALSE)
ranges = read.table("ranges.txt", sep= "", stringsAsFactors=FALSE)
centerdf <- data.frame('Center_Point' = numeric(0))
Center_Point<-apply(tfFile[c('V2', 'V3')], 1, median, na.rm=TRUE)
data<-cbind(tfFile,Center_Point)
tempdf = data.frame( 'Center_Point' = numeric(0), "TF" = character(0),stringsAsFactors = FALSE)
generatedata<-function(data, lamina){
matchesdf <- data.frame( 'Center_Point' = numeric(0), "TF" = character(0),    stringsAsFactors = FALSE)

#Making the comparisons
for(j in 1:nrow(data)){
  for(k in 1:nrow(ranges)){
  #if the value falls within the LADs
    if(data$Center_Point[j]< ranges$V3[k] && data$Center_Point[j]>ranges$V2[k]){
      tempdf<-data.frame(Center_Point = data$Center_Point[j], TF = data$V4[j])
      matchesdf <- rbind(matchesdf, tempdf)
   } 
 }
}
return(matchesdf)
}
a<-generatedata(data, ranges)

it would be great if you gave us the structure of `tfile` and `ranges` rather than just telling us they're read from txt files (not relevant). Please see [how to write a great R reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — C8H10N4O2, Mar 01 '16 at 03:32
Can you explain precisely what information you want to know? It seems to me that either (a) you want to know, for each range, whether *each* median falls in that range, (b) only when the "chr" column matches, whether *each* median falls in that range, or (c) something else, e.g. for only row-by-row whether the given row matches the given range in the other table, etc...? — Philip, Mar 01 '16 at 03:49
Thank you for taking the time to answer this. I am looking for (a) for each range, I need to see if each median falls in that range, and record that median as well as its label which is the columns including data like KLF4 and CMYC. — Satchmo, Mar 01 '16 at 04:02

Philip · Accepted Answer · 2016-03-01T21:44:00.813

Please see my comment--I'm not sure exactly what you're trying to do, but it seems to have the flavor of a data.table join. I've copied your tables as data.tables so that:

> d1
     chr       low      high sthg       mid
1:  chr1 242601432 242601632 KLF4 242601532
2: chr11  85912879  85913079 CMYC  85912979
3: chr14  86369800  86370000 SOX2  86369900
4:  chr3 158289024 158289224 CMYC 158289124
5:  chr3 123709437 123709637 SOX2 123709537
6:  chr6   8397251   8397451 SOX2   8397351
> d2
    chr range.low range.high
1: chr1  11323785   11617177
2: chr1  12645605   13926923
3: chr1  14750216   15119039
4: chr1  18102157   19080189
5: chr1  29491029   30934636
6: chr1  33716472   35395979

And I've also done

setkey(d1,chr)
setkey(d2,chr)

Now, I can join these on the chr column, so when chr matches, you'll see each range:

> d2[d1]
      chr range.low range.high       low      high sthg       mid
 1:  chr1  11323785   11617177 242601432 242601632 KLF4 242601532
 2:  chr1  12645605   13926923 242601432 242601632 KLF4 242601532
 3:  chr1  14750216   15119039 242601432 242601632 KLF4 242601532
 4:  chr1  18102157   19080189 242601432 242601632 KLF4 242601532
 5:  chr1  29491029   30934636 242601432 242601632 KLF4 242601532
 6:  chr1  33716472   35395979 242601432 242601632 KLF4 242601532
 7: chr11        NA         NA  85912879  85913079 CMYC  85912979
 8: chr14        NA         NA  86369800  86370000 SOX2  86369900
 9:  chr3        NA         NA 158289024 158289224 CMYC 158289124
10:  chr3        NA         NA 123709437 123709637 SOX2 123709537
11:  chr6        NA         NA   8397251   8397451 SOX2   8397351

Now you can use a simple data.table operation to make a single pass through and identify where the median point falls in the range:

d <- d2[d1]
d[!is.na(range.low+range.high),
  falls.in.range:=(range.low <= mid & mid <= range.high)]
d

     chr range.low range.high       low      high sthg       mid falls.in.range
 1:  chr1  11323785   11617177 242601432 242601632 KLF4 242601532          FALSE
 2:  chr1  12645605   13926923 242601432 242601632 KLF4 242601532          FALSE
 3:  chr1  14750216   15119039 242601432 242601632 KLF4 242601532          FALSE
 4:  chr1  18102157   19080189 242601432 242601632 KLF4 242601532          FALSE
 5:  chr1  29491029   30934636 242601432 242601632 KLF4 242601532          FALSE
 6:  chr1  33716472   35395979 242601432 242601632 KLF4 242601532          FALSE
 7: chr11        NA         NA  85912879  85913079 CMYC  85912979             NA
 8: chr14        NA         NA  86369800  86370000 SOX2  86369900             NA
 9:  chr3        NA         NA 158289024 158289224 CMYC 158289124             NA
10:  chr3        NA         NA 123709437 123709637 SOX2 123709537             NA
11:  chr6        NA         NA   8397251   8397451 SOX2   8397351             NA

Not a beautiful example since none of the chr1 cases seem to meet the condition, but hopefully this gets the point across.

The key thing to note is that data.table joins are insanely fast, so if you choose your join columns correctly you should be able to take advantage of a fast join even on a large table and then make a single pass through this large table. You may need to consider a cross join depending on the particular problem. (See also: ?CJ and possibly allow.cartesian in ?data.table.)

Edit if you really mean you want to know for every range whether each midpoint falls in the range, then yes, you are in cross-join territory. Note this means that you essentially consider the "chr1"-style and "KLF4"-style columns to be extraneous to the question. In this case I might do something like this:

d1[,observation.ID:=.I]
setkey(d1,observation.ID)
d2[,range.ID:=.I]
setkey(d2,range.ID)
d <- CJ(observation.ID=d1[,observation.ID],range.ID=d2[,range.ID])
setkey(d,observation.ID)
d[d1,mid:=i.mid]
setkey(d,range.ID)
d[d2,c("range.low","range.high"):=.(i.range.low,i.range.high)]
d[,falls.in.range:=range.low <= mid & mid <= range.high]

> d
    observation.ID range.ID       mid range.low range.high falls.in.range
 1:              1        1 242601532  11323785   11617177          FALSE
 2:              2        1  85912979  11323785   11617177          FALSE
 3:              3        1  86369900  11323785   11617177          FALSE
 4:              4        1 158289124  11323785   11617177          FALSE
 5:              5        1 123709537  11323785   11617177          FALSE
 6:              6        1   8397351  11323785   11617177          FALSE
 7:              1        2 242601532  12645605   13926923          FALSE
 8:              2        2  85912979  12645605   13926923          FALSE
 9:              3        2  86369900  12645605   13926923          FALSE
10:              4        2 158289124  12645605   13926923          FALSE
11:              5        2 123709537  12645605   13926923          FALSE
12:              6        2   8397351  12645605   13926923          FALSE
13:              1        3 242601532  14750216   15119039          FALSE
14:              2        3  85912979  14750216   15119039          FALSE
15:              3        3  86369900  14750216   15119039          FALSE
16:              4        3 158289124  14750216   15119039          FALSE
17:              5        3 123709537  14750216   15119039          FALSE
18:              6        3   8397351  14750216   15119039          FALSE
19:              1        4 242601532  18102157   19080189          FALSE
20:              2        4  85912979  18102157   19080189          FALSE
21:              3        4  86369900  18102157   19080189          FALSE
22:              4        4 158289124  18102157   19080189          FALSE
23:              5        4 123709537  18102157   19080189          FALSE
24:              6        4   8397351  18102157   19080189          FALSE
25:              1        5 242601532  29491029   30934636          FALSE
26:              2        5  85912979  29491029   30934636          FALSE
27:              3        5  86369900  29491029   30934636          FALSE
28:              4        5 158289124  29491029   30934636          FALSE
29:              5        5 123709537  29491029   30934636          FALSE
30:              6        5   8397351  29491029   30934636          FALSE
31:              1        6 242601532  33716472   35395979          FALSE
32:              2        6  85912979  33716472   35395979          FALSE
33:              3        6  86369900  33716472   35395979          FALSE
34:              4        6 158289124  33716472   35395979          FALSE
35:              5        6 123709537  33716472   35395979          FALSE
36:              6        6   8397351  33716472   35395979          FALSE

(You can join the other detail columns on after the fact, e.g. setkey(d,observation.ID);setkey(d1,observation.ID);d[d1,sthg:=i.sthg] to get the "KLF4" column as I've named it.) But note that this probably isn't going to save a ton of time; if you're doing a full check of all midpoints against all ranges, then the speed-up is pretty much only in the better-vectorized data.table expression versus your nested for loops. So I'm not sure if this will be much better for your large table. Maybe try it and report back?

Update re typo: see the example below for a comparison of && (incorrect in this case) and & (correct in this case). &&, as you point out, only evaluates the first element of vectors, while & compares across vectors and returns a vector. So the output of && is recycled, yielding incorrect results when you mean to compare row-by-row:

> d1[,using.double.and:=low < mid && mid==242601532]
> d1[,using.single.and:=low < mid & mid==242601532]
> d1
     chr       low      high sthg       mid observation.ID using.double.and using.single.and
1:  chr1 242601432 242601632 KLF4 242601532              1             TRUE             TRUE
2: chr11  85912879  85913079 CMYC  85912979              2             TRUE            FALSE
3: chr14  86369800  86370000 SOX2  86369900              3             TRUE            FALSE
4:  chr3 158289024 158289224 CMYC 158289124              4             TRUE            FALSE
5:  chr3 123709437 123709637 SOX2 123709537              5             TRUE            FALSE
6:  chr6   8397251   8397451 SOX2   8397351              6             TRUE            FALSE

After a bit of tinkering with what you suggested, it worked! I am concerned that absolutely none of the mid's were in any of the ranges though. — Satchmo, Mar 01 '16 at 06:26
While running your for loop did you get any mids in any ranges? Glancing at the example data it does not look as though any are, but if lives depend on the answer you might want to check more carefully. — Philip, Mar 01 '16 at 15:15
After some testing, it appears that only the first row is where the mid is being compared to the range. Is there a way to apply this to every row without using for loops? — Satchmo, Mar 01 '16 at 20:22
Oh, that's a typo in my code: use `&` instead of `&&`. Will update. Good catch! — Philip, Mar 01 '16 at 21:34
Philip, great answer (already upvoted before). Just wanted to bring to attention the new non-equi join feature in data.table.. Cheers. — Arun, Jul 29 '16 at 16:26

score 2 · Answer 2 · answered Jul 29 '16 at 16:21

Using the new non-equi joins feature in the current development version of data.table, this is straightforward:

require(data.table) # v1.9.7+
d2[d1, .(mid, sthg), on=.(chr, range.low < mid, range.high > mid), nomatch=0L]

That's it. In this case, there's no match. Therefore an empty data.table is returned.

See the installation instructions for devel version here.

PS: I've used Philip's dataset, but without the setkey() part (as it is not necessary when using the on argument).

Generate a dataframe of values that fall within a large series of ranges

2 Answers2