I have a data frame containing an ID, a start date and an end date. My data is ordered by ID, start, end (in this sequence).
Now I want all rows with the same ID having an overlapping time span (or have a start date that is right the day after the end date of another row) to be merged together.
Merging them means that they end up in one row having the same ID, the min(start date) and the max(end date) (I hope you understand what I mean).
I have written a function for that (it is not fully tested, but it looks fine for the moment). The problem is, as my data frame has nearly 100.000 observations, the function is very slow.
Can you help me improve my function in terms of efficiency?
Here is the function
smoothingEpisodes <- function (theData) {
theOutput <- data.frame()
curId <- theData[1, "ID"]
curStart <- theData[1, "START"]
curEnd <- theData[1, "END"]
for(i in 2:nrow(theData)) {
nextId <- theData[i, "ID"]
nextStart <- theData[i, "START"]
nextEnd <- theData[i, "END"]
if (curId != nextId | (curEnd + 1) < nextStart) {
theOutput <- rbind(theOutput, data.frame("ID" = curId, "START" = curStart, "END" = curEnd))
curId <- nextId
curStart <- nextStart
curEnd <- nextEnd
} else {
curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
}
}
theOutput <- rbind(theOutput, data.frame("ID" = curId, "START" = curStart, "END" = curEnd))
theOutput
}
Thank you!
[edit]
test data:
ID START END
1 1 2000-01-01 2000-03-31
2 1 2000-04-01 2000-05-31
3 1 2000-04-15 2000-07-31
4 1 2000-09-01 2000-10-31
5 2 2000-01-15 2000-03-31
6 2 2000-02-01 2000-03-15
7 2 2000-04-01 2000-04-15
8 3 2000-06-01 2000-06-15
9 3 2000-07-01 2000-07-15
(START and END have data type "Date", ID is a numeric)
A dput of the data:
structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), START = structure(c(10957,
11048, 11062, 11201, 10971, 10988, 11048, 11109, 11139), class = "Date"),
END = structure(c(11047, 11108, 11169, 11261, 11047, 11031,
11062, 11123, 11153), class = "Date")), .Names = c("ID",
"START", "END"), class = "data.frame", row.names = c(NA, 9L))