Make Dataframe loop code run faster

Question

DATA AND REQUIREMENTS

The first table (myMatrix1) is from an old geological survey that used different region boundaries (begin and finish) columns to the newer survey. What I wish to do is to match the begin and finish boundaries and then create two tables one for the new data on sedimentation and one for the new data on bore width characterised as a boolean.

myMatrix1 <- read.table("/path/to/file")
myMatrix2 <- read.table("/path/to/file")

> head(myMatrix1)  # this is the old data

    sampleIDs begin finish   
1    19990224 4     5 
2    20000224 5     6 
3    20010203 6     8 
4    20019024 29    30 
5    20020201 51    52 

> head(myMatrix2)   # this is the new data

     begin finish  sedimentation    boreWidth
1    0     10       1.002455        0.014354
2    11    367      2.094351        0.056431
3    368   920      0.450275        0.154105
4    921   1414     2.250820        1.004353
5    1415  5278     0.114109        NA`

Desired output:

> head(myMatrix6)

    sampleIDs begin finish  sedimentation #myMatrix4
1    19990224 4     5       1.002455
2    20000224 5     6       1.002455
3    20010203 6     8       2.094351
4    20019024 29    30      2.094351
5    20020201 51    52      2.094351

> head(myMatrix7)

    sampleIDs begin finish  boreWidthThresh #myMatrix5
1    19990224 4     5       FALSE
2    20000224 5     6       FALSE
3    20010203 6     8       FALSE
4    20019024 29    30      FALSE
5    20020201 51    52      FALSE`

CODE

The following code has taken me several hours to run on my dataset (about 5 million data points). Is there any way to change the code to make it run any faster?

# create empty matrix for sedimentation
myMatrix6 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix6) <- letters[1:4]

# create empty matrix for bore
myMatrix7 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix7) <- letters[1:4]

for (i in 1:nrow(myMatrix2))
{       
    # create matrix that has the value of myMatrix1$begin being 
    # situated between the values of myMatrix2begin[i] and myMatrix2finish[i]
    myMatrix3 <- myMatrix1[which((myMatrix1$begin > myMatrix2$begin[i]) & (myMatrix1$begin <      myMatrix2$finish[i])),]

    myMatrix4 <- rep(myMatrix2$sedimentation, nrow(myMatrix3))

    if (is.na(myMatrix2$boreWidth[i])) {
        myMatrix5 <- rep(NA, nrow(myMatrix3))
    }
    else if (myMatrix2$boreWidth[i] == 0) {
    myMatrix5 <- rep(TRUE, nrow(myMatrix3))
    }
    else if (myMatrix2$boreWidth[i] > 0) {
    myMatrix5 <- rep(FALSE, nrow(myMatrix3))
    }

    myMatrix6 <- rbind(myMatrix6, cbind(myMatrix3, myMatrix4))
    myMatrix7 <- rbind(myMatrix7, cbind(myMatrix3, myMatrix5))
}

EDIT:

> dput(head(myMatrix2)

structure(list(V1 = structure(c(6L, 1L, 2L, 4L, 5L, 3L), .Label = c("0", 
"11", "1415", "368", "921", "begin"), class = "factor"), V2 = structure(c(6L, 
1L, 3L, 5L, 2L, 4L), .Label = c("10", "1414", "367", "5278", 
"920", "finish"), class = "factor"), V3 = structure(c(6L, 3L, 
4L, 2L, 5L, 1L), .Label = c("0.114109", "0.450275", "1.002455", 
"2.094351", "2.250820", "sedimentation"), class = "factor"), 
    V4 = structure(c(5L, 1L, 2L, 3L, 4L, 6L), .Label = c("0.014354", 
    "0.056431", "0.154105", "1.004353", "boreWidth", "NA"), class = "factor")), .Names = c("V1", 
"V2", "V3", "V4"), row.names = c(NA, 6L), class = "data.frame")

> dput(head(myMatrix1)

structure(list(V1 = structure(c(6L, 1L, 2L, 3L, 4L, 5L), .Label = c("19990224", 
"20000224", "20010203", "20019024", "20020201", "sampleIDs"), class = "factor"), 
    V2 = structure(c(6L, 2L, 3L, 5L, 1L, 4L), .Label = c("29", 
    "4", "5", "51", "6", "begin"), class = "factor"), V3 = structure(c(6L, 
    2L, 4L, 5L, 1L, 3L), .Label = c("30", "5", "52", "6", "8", 
    "finish"), class = "factor")), .Names = c("V1", "V2", "V3"
), row.names = c(NA, 6L), class = "data.frame")

Ugh, `rbind` AND `cbind`. You're asking for trouble. Also, your code doesn't run - missing objects (`myMatrix2`). Check out http://www.burns-stat.com/pages/Tutor/R_inferno.pdf — Roman Luštrik, Nov 27 '11 at 12:00
Hi Roman, I haven't put the code in but `myMatrix2` and `myMatrix1` have been initialized by reading them in from a table. Is that what you meant by missing objects (`myMatrix2`)? What alternatives are there to `rbind` and `cbind` for the code above? Thanks — Kaleb, Nov 27 '11 at 12:15

score 4 · Accepted Answer · edited May 23 '17 at 12:27

4

First look at these general suggestions on speeding up code: https://stackoverflow.com/a/8474941/636656

The first thing that jumps out at me is that I'd create only one results matrix. That way you're not duplicating the sampleIDs begin finish columns, and you can avoid any overhead that comes with running the matching algorithm twice.

Doing that, you can avoid selecting more than once (although it's trivial in terms of speed as long as you store your selection vector rather than re-calculate).

Here's a solution using apply:

myMatrix1 <- data.frame(sampleIDs=c(19990224,20000224),begin=c(4,5),finish=c(5,6))
myMatrix2 <- data.frame(begin=c(0,11),finish=c(10,367),sed=c(1.002,2.01),boreWidth=c(.014,.056))

glommer <- function(x,myMatrix2) {
  x[4:5] <- as.numeric(myMatrix2[ myMatrix2$begin <= x["begin"] & myMatrix2$finish >= x["finish"], c("sed","boreWidth") ])
  names(x)[4:5] <- c("sed","boreWidth")
  return( x )
}

> t(apply( myMatrix1, 1, glommer, myMatrix2=myMatrix2))
     sampleIDs begin finish   sed boreWidth
[1,]  19990224     4      5 1.002     0.014
[2,]  20000224     5      6 1.002     0.014

I used apply and stored everything as numeric. Other approaches would be to return a data.frame and have the sampleIDs and begin, finish be ints. That might avoid some problems with floating point error.

This solution assumes there are no boundary cases (e.g. the begin, finish times of myMatrix1 are entirely contained within the begin, finish times of the other). If your data is more complicated, just change the glommer() function. How you want to handle that is a substantive question.

edited May 23 '17 at 12:27

Community

1
1

answered Nov 27 '11 at 13:11

Ari B. Friedman

71,271
35
175
235

I can't move the first line in the loop outside the loop because I need to capture those rows of `myMatrix1$begin` that have a value that is both greater than the value in the ith row of `myMatrix2$begin` and less than the value in the ith row of `myMatrix2$end`. I'll try to update with a brief summary of my data. Btw, since I'm new to R and have had to dive straight in, I do not know what you mean by vectorization. From what I understand, I should vectorize instead of loop if there is no dependency on a previous iteration. However, there is with my data: hence I'm using `cbind` – Kaleb Nov 27 '11 at 13:30
Ah, I missed the `i` in that line. Vectorization would let you avoid the loop. Whether or not it would make it faster depends how its implemented. Perhaps you could describe what you're going for and what your data looks like a little bit? It would be easy to just copy your code and make little snippets faster, while missing the bigger picture of what you're trying to accomplish (potentially only in a line or two) if we don't know what you're aiming for. – Ari B. Friedman Nov 27 '11 at 13:35
Update posted with a summary of the data that I'm using – Kaleb Nov 27 '11 at 14:19
Are the ranges of `begin` and `finish` in myMatrix1 always neatly contained within the ranges in myMatrix2? That would make this easier. – Ari B. Friedman Nov 27 '11 at 14:43
Also can you put in the results of `dput(head(myMatrix#))` so we don't have to re-type in your sample data? – Ari B. Friedman Nov 27 '11 at 14:44
re: whether the ranges of `begin` and `finish` are always neatly contained within the ranges in myMatrix2, I presume so but cannot be certain. I suppose that's part of the difficulty. Just running the `dput` function now. – Kaleb Nov 27 '11 at 15:26
Something is very wrong with your `dput` output. Both of them return errors. – Ari B. Friedman Nov 27 '11 at 15:59
your `apply` solution is great. However, I need to store the "sedimentation" and "boreWidth" dataframes separately. Furthermore, the bore width is meant to be transformed into one of `TRUE` or `FALSE` or `NA` depending upon its size. How can I incorporate this into your `glommer` function? Thanks. – Kaleb Nov 27 '11 at 16:22
You don't need to incorporate those into the apply/glommer solution. Just do them after the fact. – Ari B. Friedman Nov 27 '11 at 17:38
Ok. Thanks a lot: your explanations were very clear and detailed. I think its working now, so I'll mark this solved. – Kaleb Nov 27 '11 at 18:04
General solutions moved here: http://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r/8474941#8474941 – Ari B. Friedman Dec 12 '11 at 13:17

Make Dataframe loop code run faster

1 Answers1