"smoothing" time data - can it be done more efficient?

Question

I have a data frame containing an ID, a start date and an end date. My data is ordered by ID, start, end (in this sequence).

Now I want all rows with the same ID having an overlapping time span (or have a start date that is right the day after the end date of another row) to be merged together.

Merging them means that they end up in one row having the same ID, the min(start date) and the max(end date) (I hope you understand what I mean).

I have written a function for that (it is not fully tested, but it looks fine for the moment). The problem is, as my data frame has nearly 100.000 observations, the function is very slow.

Can you help me improve my function in terms of efficiency?

Here is the function

smoothingEpisodes <- function (theData) {
    theOutput <- data.frame()

    curId <- theData[1, "ID"]
    curStart <- theData[1, "START"]
    curEnd <- theData[1, "END"]

    for(i in 2:nrow(theData)) {
        nextId <- theData[i, "ID"]
        nextStart <- theData[i, "START"]
        nextEnd <- theData[i, "END"]

        if (curId != nextId | (curEnd + 1) < nextStart) {
            theOutput <- rbind(theOutput, data.frame("ID" = curId, "START" = curStart, "END" = curEnd))

            curId <- nextId
            curStart <- nextStart
            curEnd <- nextEnd
        } else {
            curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
        }
    }
    theOutput <- rbind(theOutput, data.frame("ID" = curId, "START" = curStart, "END" = curEnd))

    theOutput
}

Thank you!

[edit]

test data:

    ID      START        END
1    1 2000-01-01 2000-03-31
2    1 2000-04-01 2000-05-31
3    1 2000-04-15 2000-07-31
4    1 2000-09-01 2000-10-31
5    2 2000-01-15 2000-03-31
6    2 2000-02-01 2000-03-15
7    2 2000-04-01 2000-04-15
8    3 2000-06-01 2000-06-15
9    3 2000-07-01 2000-07-15

(START and END have data type "Date", ID is a numeric)

A dput of the data:

structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), START = structure(c(10957, 
11048, 11062, 11201, 10971, 10988, 11048, 11109, 11139), class = "Date"), 
    END = structure(c(11047, 11108, 11169, 11261, 11047, 11031, 
    11062, 11123, 11153), class = "Date")), .Names = c("ID", 
"START", "END"), class = "data.frame", row.names = c(NA, 9L))

the output of `dput()` is more useful as we need the objects to be Dates. — Gavin Simpson, Jun 21 '11 at 13:37

Gavin Simpson · Accepted Answer · 2011-06-21T14:29:47.240

The first [without really thinking to hard about what you are trying to do] optimisation I would suggest is to allocate storage for theOutput. At the moment, you are growing theOutput at each iteration of the loop. In R that is an absolute no no!! That is something you never do, unless you like woefully slow code. R has to copy the object and expand it during each iteration and that is slow.

Looking at the code, we know that theOutput needs to have nrow(theData) - 1 rows, and 3 columns. So create that before the loop starts:

theOutput <- data.frame(matrix(ncol = 3, nrow = nrow(theData) - 1))

then fill in this object during the loop:

theOutput[i, ] <- data.frame("ID" = curId, "START" = curStart, "END" = curEnd))

for example.

It isn't clear what START and END are? if these are numerics, then working with a matrix and not a data frame could also improve speed efficiency.

Also, creating a data frame each iteration is going to be slow. I can't time this without spending a lot of my own time, but you could just fill in the bits you want directly, without incurring the data.frame() call during each iteration:

theOutput[i, "ID"] <- curId
theOutput[i, "START"] <- curStart
theOutput[i, "END"] <- curEnd

The best tip I can give you however, is to profile your code. See where the bottlenecks are and speed those up. Run your function on a smaller subset of the data; the size of which is sufficient to give you a bit of run-time to gather useful profiling data without having to wait for ages to get the profiling run completed. To profile in R, use Rprof():

Rprof(filename = "my_fun_profile.Rprof")
## run your function call here on a subset of the data
Rprof(NULL)

The you can look at the output using

summaryRprof("my_fun_profile.Rprof")

Hadley Wickham (@hadley) has a package to make this a bit easier. It is called profr. And as Dirk reminds me in the comments, there is also Luke Tierney's proftools package.

Edit: as the OP provided some test data I knocked up something quick to show the speed-up achieved by just following good loop practice:

smoothingEpisodes2 <- function (theData) {
    curId <- theData[1, "ID"]
    curStart <- theData[1, "START"]
    curEnd <- theData[1, "END"]
    nr <- nrow(theData)
    out1 <- integer(length = nr)
    out2 <- out3 <- numeric(length = nr)
    for(i in 2:nrow(theData)) {
        nextId <- theData[i, "ID"]
        nextStart <- theData[i, "START"]
        nextEnd <- theData[i, "END"]
        if (curId != nextId | (curEnd + 1) < nextStart) {
            out1[i-1] <- curId
            out2[i-1] <- curStart
            out3[i-1] <- curEnd
            curId <- nextId
            curStart <- nextStart
            curEnd <- nextEnd
        } else {
            curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
        }
    }
    out1[i] <- curId
    out2[i] <- curStart
    out3[i] <- curEnd
    theOutput <- data.frame(ID = out1,
                            START = as.Date(out2, origin = "1970-01-01"),
                            END = as.Date(out3, origin = "1970-01-01"))
    ## drop empty
    theOutput <- theOutput[-which(theOutput$ID == 0), ]
    theOutput
}

Using the test dataset provide in object testData, I get:

> res1 <- smoothingEpisodes(testData)
> system.time(replicate(100, smoothingEpisodes(testData)))
   user  system elapsed 
  1.091   0.000   1.131 
> res2 <- smoothingEpisodes2(testData)
> system.time(replicate(100, smoothingEpisodes2(testData)))
   user  system elapsed 
  0.506   0.004   0.517

a 50% speed up. Not dramatic but simple to achieve just by not growing an object at each iteration.

thanks for your tips! in fact theoutput could have 1 to nrow(theData) rows. However, you are right with the columns. If I initialize theOutput with nrow(theData), is there a way to get rid of the empty rows in the end? — speendo, Jun 21 '11 at 13:50
@Marcel Yes, the example I just posted deals with that. In it, `res1` and `res2` are equal except for the rownames. — Gavin Simpson, Jun 21 '11 at 14:00
@Dirk - good point, one which I had forgotten. My mentioning of Hadley's profr was not an endorsement. I haven't used it so wouldn't be able to give an informed opinion. — Gavin Simpson, Jun 21 '11 at 14:28

speendo · Answer 2 · 2011-06-21T15:14:39.260

I did it slightly different to avoid deleting empty rows in the end:

smoothingEpisodes <- function (theData) {
    curId <- theData[1, "ID"]
    curStart <- theData[1, "START"]
    curEnd <- theData[1, "END"]

    theLength <- nrow(theData)

    out.1 <- integer(length = theLength)
    out.2 <- out.3 <- numeric(length = theLength)

    j <- 1

    for(i in 2:nrow(theData)) {
        nextId <- theData[i, "ID"]
        nextStart <- theData[i, "START"]
        nextEnd <- theData[i, "END"]

        if (curId != nextId | (curEnd + 1) < nextStart) {
            out.1[j] <- curId
            out.2[j] <- curStart
            out.3[j] <- curEnd

            j <- j + 1

            curId <- nextId
            curStart <- nextStart
            curEnd <- nextEnd
        } else {
            curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
        }
    }

    out.1[j] <- curId
    out.2[j] <- curStart
    out.3[j] <- curEnd

    theOutput <- data.frame(ID = out.1[1:j], START = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))

    theOutput
}

quite a big improvement to my original version!

wanted to give the points to you :) I think anyone who reads this thread (won't be too many people I guess) will be able to scroll down. — speendo, Jun 21 '11 at 15:13
:-) I needed to write this script anyway - people like you, who care about problems of others are certainly more gracious. Another thing: although a function like this can be implemented in SQL, a procedural implementation (like this R implementation) is soo much faster. On this example one can see, that the programming paradigma has a big inpact on runtime efficiency — speendo, Jun 21 '11 at 16:35

score 1 · Answer 3 · answered Jun 21 '11 at 23:30

Marcel, I thought I'd just try to improve your code a little. The version below is about 30x faster (from 3 seconds to 0.1 seconds)... The trick is to first extract the three columns to integer and double vectors.

As a side note, I try to use [[ where applicable, and try to keep integers as integers by writing j <- j + 1L etc. That does not make any difference here, but sometimes coercing between integers and doubles can take quite some time.

smoothingEpisodes3 <- function (theData) {
    theLength <- nrow(theData)
    if (theLength < 2L) return(theData)

    id <- as.integer(theData[["ID"]])
    start <- as.numeric(theData[["START"]])
    end <- as.numeric(theData[["END"]])

    curId <- id[[1L]]
    curStart <- start[[1L]]
    curEnd <- end[[1L]]

    out.1 <- integer(length = theLength)
    out.2 <- out.3 <- numeric(length = theLength)

    j <- 1L

    for(i in 2:nrow(theData)) {
        nextId <- id[[i]]
        nextStart <- start[[i]]
        nextEnd <- end[[i]]

        if (curId != nextId | (curEnd + 1) < nextStart) {
            out.1[[j]] <- curId
            out.2[[j]] <- curStart
            out.3[[j]] <- curEnd

            j <- j + 1L

            curId <- nextId
            curStart <- nextStart
            curEnd <- nextEnd
        } else {
            curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
        }
    }

    out.1[[j]] <- curId
    out.2[[j]] <- curStart
    out.3[[j]] <- curEnd

    theOutput <- data.frame(ID = out.1[1:j], START = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))

    theOutput
}

Then, the following code will show the speed difference. I just took your data and replicated it 1000 times...

x <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), START = structure(c(10957, 
11048, 11062, 11201, 10971, 10988, 11048, 11109, 11139), class = "Date"), 
    END = structure(c(11047, 11108, 11169, 11261, 11047, 11031, 
    11062, 11123, 11153), class = "Date")), .Names = c("ID", 
"START", "END"), class = "data.frame", row.names = c(NA, 9L))

r <- 1000
y <- data.frame(ID=rep(x$ID, r) + rep(1:r, each=nrow(x))-1, START=rep(x$START, r), END=rep(x$END, r))

system.time( a1 <- smoothingEpisodes(y) )   # 2.95 seconds
system.time( a2 <- smoothingEpisodes3(y) )  # 0.10 seconds
all.equal( a1, a2 )

wow, just tried it - this is simply amazing! Can't believe this is just because of changing `[ ... ]` to `[[ ... ]]`. why does it make such a big difference? — speendo, Jun 22 '11 at 11:23
It's not the `[[` change that provides the speed boost - it's extracting the columns BEFORE the loop AND coercing the date columns to numeric. Extracting elements from a data.frame has quite some overhead. Extracting elements from a date vector also has quite some overhead. Together, it's a lot (the 30x difference)! — Tommy, Jun 22 '11 at 16:03

"smoothing" time data - can it be done more efficient?

3 Answers3

Linked