2

I am given two very large data sets and I've been trying to build a function that would find certain coordinates from one set that respect an if clause regarding the other data set. My problem is that the function I wrote is very slow and although I've been reading answers to questions similar in some way, I haven't managed to make it work.
So if I am given:

>head(CTSS)    
    V1     V2     V3
1 chr1 564563 564598 
2 chr1 564620 564649
3 chr1 565369 565404
4 chr1 565463 565541
5 chr1 565653 565697
6 chr1 565861 565922

and

> head(href)
   chr      region    start      end strand nu   gene_id transcript_id
1 chr1 start_codon 67000042 67000044      +  . NM_032291     NM_032291
2 chr1         CDS 67000042 67000051      +  0 NM_032291     NM_032291
3 chr1        exon 66999825 67000051      +  . NM_032291     NM_032291
4 chr1         CDS 67091530 67091593      +  2 NM_032291     NM_032291
5 chr1        exon 67091530 67091593      +  . NM_032291     NM_032291
6 chr1         CDS 67098753 67098777      +  1 NM_032291     NM_032291

For each value in the start column from the href data set I want to find the first two values in the 3rd column of the CTSS data set smaller or equal than it and keep it in a new dataframe.
The loop I wrote:

y <- CTSS[order(-CTSS$V3), ]     
find_CTSS <- function(x, y) {
    n <- length(x$start)
    foo <- data.frame(matrix(0, n, 6))
    for (i in 1:n)
    {
        a <- which(y$V3 <= x$start[i])
        foo[i, ] = c(x$start[i], x$stop[i], y$V2[a[1]], y$V3[a[1]] , y$V2[a[2]], y$V3[a[2]])
    }

print(foo)

}
Qantas 94 Heavy
  • 15,750
  • 31
  • 68
  • 83
Nanami
  • 3,319
  • 3
  • 19
  • 19
  • I *think* this is one of the things `data.table` could make substantially faster. – Ari B. Friedman Aug 20 '11 at 12:35
  • By first do you just mean first in V3 in the current data frame order or do you mean the lowest two values, or do you mean the values most immediately below or equal to start?... or something else. I suppose I could go with Roman Lustrik's code as a base for definition? – John Aug 22 '11 at 12:13

3 Answers3

5

You provide little data (but see here) so it's a bit hard to benchmark your solution. See if the below solution is meeting your needs.

#make some fake data
href <- data.frame(start = runif(10), stop = runif(10), other_col = sample(letters, 10))
CTSS <- data.frame(col1 = runif(100), col2 = runif(100))

# for each row in href (but extract only stop and start columns)
result <- apply(X = href[, c("start", "stop")], MARGIN = 1, FUN = function(x, ctss) {
            criterion <- x["start"] #make a criterion
            #see which values are smaller or equal to this criterion (and sort them)
            extracted <- sort(ctss[ctss$col2 <= criterion, "col2"])
            #extract last and one to last value
            get.values <- extracted[c(length(extracted) - 1, length(extracted))] 
            #put values in data frame
            out <- as.data.frame(matrix(get.values, ncol = 2)) 
            return(out)
        }, ctss = CTSS)

#pancake a list into a data.frame
result <- do.call("rbind", result) 
Community
  • 1
  • 1
Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
  • I've tried it this way, but it still takes a very long time. The dimension I have for my two data sets are: dim(CTSS): 76263 lines and 632 columns and dim(href): 791930 lines and 8 columns. Would it be better if I would just create new dataframes that contain only the columns I need from both of them? – Nanami Aug 20 '11 at 12:07
  • you can achieve a speedup by using the `partial` option in sort since you are only interested in the first two values. some care is required since the number of elements satisfying the criterion might be less than 2 in which case `partial = c(1, 2)` will throw an error. should be possible to handle this with a `failwith` or some other statement. will post a solution if time permits – Ramnath Aug 20 '11 at 13:00
  • +1 This is the best solution of everything I've tried so far. – Andrie Aug 20 '11 at 14:42
1

I see that the main thing you want is a speedup here. Borrowing heavily from Roman Lustrik's code I can't see any advantage to putting sort inside the apply. That would really slow things down. In fact, you want to get as much as possible out of the apply (loop). So the following should run much faster.

#all code using Roman Lustrik's made up data

CTSSs <- CTSS[order(CTSS$col2),] #presort CTSS
result <- lapply(X = href$start, FUN = function(x, ctss) {
    extracted <- ctss$col2[ctss$col2 <= x]
    get.values <- tail(extracted,2)
    out <- matrix(get.values, ncol = 2)
    return(out)}, ctss = CTSSs)
#pancake a list into a data.frame
result <- as.data.frame(do.call("rbind", result))

Or, I could follow the spirit of vectorization further and really get the functions as small as possible.

CTSSs <- CTSS[order(CTSS$col2),] #presort CTSS
extracted <- lapply(href$start, function(x, ctss) {
    ctss$col2[ctss$col2 <= x]}, ctss = CTSSs)
get.values <- lapply(extracted, tail, n = 2 )
result <- t( sapply(get.values, matrix, ncol = 2) )

#convert to a data.frame
result <- as.data.frame(result)

This may be faster, or maybe not in your case, but, should you need to add an intermediate step that could possibly take advantage of the truly vectorized built in functions, say if you want to do math on the values before putting them into a dataframe, then you can easily do that. Also, you'll note that now I can use a sapply/transpose at the matrix stage which will be faster than lapply/rbind. And that's often where the speedup of vectorizing your code comes from, not from just making a big loop with apply around it. (as an aside, it makes it easier to check for errors in each step of your thinking... or maybe that's not an aside?)

REVISION:

On further reflection I realized this can be completely vectorized. The following code will generate what you want MUCH faster than any of the prior examples. The trick is to use cut() and aggregate() commands.

href <- href[order(href$start),] #just sorted so that the 0 at the beginning makes sense and the labels then match
margin <- cut(CTSS$col2, breaks = c(0,href$start), labels = href$start, right = TRUE)
result <- aggregate(col2 ~ margin, data = CTSS, FUN = function(x) tail(x,2))

You can reformat the result as you wish to get what you want but that should do the meat of it. You might want to change the margin column to numeric so that it matches href$start and use similar code to the sapply in the middle example above to turn the list of pairs of items above into two separate columns. It was the if() statement in the loop or apply statement that was slowing you down before and cut() eliminates that.

John
  • 23,360
  • 7
  • 57
  • 83
0

I don't know how long I'll devote myself to this, so I'll build forward. I was an APL guy back when this kind of question received one-line answers in the APL Journal. Later I became a C++/STL guy and learned all the same stuff in a new dress code. Sometimes R makes me think that APL mated with PHP.

In this problem the dataframes are a distraction. This is a simple vector search, with some gluing back together.

For the performance critical vector search, you want findInterval. The search-withins need to be ordered. The search-fors can be in any order, but for large lists, you want ordered.

    V <- sort (runif(10*1000*1000))
    U <- sort (runif(10*1000*1000)) 
    W <- findInterval (U, V) 

This runs in three shakes of a lamb's tail. Now you have pairs of integers. The first column is 1:length(U) and the second column with the values of V is integer index within W.

    sum(u==sort(u)[sort.int (sort.int (u, index.return=TRUE)$ix, index.return=TRUE)$ix])

OK, there's a contribution from my APL brainstem. The answer is length(u) and demonstrates the inverse sort required for the "glue back together".

Mind blowing fact: only special cases of the sort function in R return the index vector. In APL, that was the only answer you got from grade up/grade down. But hey, it's not like they got it right the first time.

You'll have to adapt the result of findInterval to pick two elements on the less than side of the match location and you'll have to undo the two sorts to glue back together. I suspect your runtime will either be dominated by the two sorts (for very long lists) or assembling the resulting dataframe (for smaller lists). On my system, sorting a numeric list of length 100*1000*1000 begins to bog.

The run time for findInterval sandwiched in between will be a thin slice of lettuce, which reminds me why I wasn't planning to loiter.

Allan Stokes
  • 515
  • 6
  • 5