Here is another approach that should be faster than processing a list. It relies on data.table
joins and lubridate
for binning times at closest minute. It also assumes that there were 0 users before you started recording them, but this can be fixed by adding a constant number to concurrent
at the end:
library(data.table)
library(lubridate)
td <- data.table(start=floor_date(tdata$start, "minute"),
end=ceiling_date(tdata$end, "minute"))
# create vector of all minutes from start to end
# about 530K for a whole year
time.grid <- seq(from=min(td$start), to=max(td$end), by="min")
users <- data.table(time=time.grid, key="time")
# match users on starting time and
# sum matches by start time to count multiple loging in same minute
setkey(td, start)
users <- td[users,
list(started=!is.na(end)),
nomatch=NA,
allow.cartesian=TRUE][, list(started=sum(started)),
by=start]
# match users on ending time, essentially the same procedure
setkey(td, end)
users <- td[users,
list(started, ended=!is.na(start)),
nomatch=NA,
allow.cartesian=TRUE][, list(started=sum(started),
ended=sum(ended)),
by=end]
# fix timestamp column name
setnames(users, "end", "time")
# here you can exclude all entries where both counts are zero
# for a sparse representation
users <- users[started > 0 | ended > 0]
# last step, take difference of cumulative sums to get concurrent users
users[, concurrent := cumsum(started) - cumsum(ended)]
The two complex-looking joins can be split into two (first join, then summary by minute), but I recall reading that this way is more efficient. If not, splitting them would make the operations more legible.