0

I am iterating along a POSIX sequence to identify the number of concurrent events at a given time with exactly the method described in this question and the corresponding answer:

How to count the number of concurrent users using time interval data?

My problem is that my tinterval sequence in minutes covers a year, which means it has 523.025 entries. In addition, I am also thinking about a resolution in seconds, which would make thinks even worse.

Is there anything I can do to improve this code (e.g. is the order of the date intervals from the input data (tdata) of relevance?) or do I have to accept the performance if I like to have a solution in R?

Community
  • 1
  • 1
fr3d-5
  • 792
  • 1
  • 6
  • 27
  • It seems that you could gain some speed by changing `tinterval`, `tdata$start` and `tdata$end` to "numeric" and applying the suggested (in the linked QA) solution on these. – alexis_laz Oct 15 '14 at 15:08

3 Answers3

3

You could try using data.tables new foverlaps function. With the data from the other question:

library(data.table)
setDT(tdata)
setkey(tdata, start, end)
minutes <- data.table(start = seq(trunc(min(tdata[["start"]]), "mins"), 
                                  round(max(tdata[["end"]]), "mins"), by="min"))
minutes[, end := start+59]
setkey(minutes, start, end)
DT <- foverlaps(tdata, minutes, type="any")
counts <- DT[, .N, by=start]
plot(N~start, data=counts, type="s")

resulting plot

I haven't timed this for huge data. Try yourself.

Roland
  • 127,288
  • 10
  • 191
  • 288
1

Here is another approach that should be faster than processing a list. It relies on data.table joins and lubridate for binning times at closest minute. It also assumes that there were 0 users before you started recording them, but this can be fixed by adding a constant number to concurrent at the end:

library(data.table)
library(lubridate)

td <- data.table(start=floor_date(tdata$start, "minute"),
                 end=ceiling_date(tdata$end, "minute"))

# create vector of all minutes from start to end
# about 530K for a whole year
time.grid <- seq(from=min(td$start), to=max(td$end), by="min")
users <- data.table(time=time.grid, key="time")

# match users on starting time and 
# sum matches by start time to count multiple loging in same minute
setkey(td, start)
users <- td[users, 
          list(started=!is.na(end)), 
          nomatch=NA, 
          allow.cartesian=TRUE][, list(started=sum(started)), 
                                by=start]

# match users on ending time, essentially the same procedure
setkey(td, end)
users <- td[users, 
            list(started, ended=!is.na(start)), 
            nomatch=NA, 
            allow.cartesian=TRUE][, list(started=sum(started), 
                                         ended=sum(ended)), 
                                  by=end]

# fix timestamp column name
setnames(users, "end", "time")

# here you can exclude all entries where both counts are zero
# for a sparse representation
users <- users[started > 0 | ended > 0]

# last step, take difference of cumulative sums to get concurrent users
users[, concurrent := cumsum(started) - cumsum(ended)]

The two complex-looking joins can be split into two (first join, then summary by minute), but I recall reading that this way is more efficient. If not, splitting them would make the operations more legible.

ilir
  • 3,236
  • 15
  • 23
0

R is an interpretive language, which means that every time you ask it to execute a command, it has to interprete your code first, and then execute it. For loops it means that in each iteration of for it has to "recompile" your code, which is, of course, very slow. There are three common way that I am aware of, which help solve this.

  1. R is vector-oriented, so loops are most likely not a good way to use it. So, if possible, you should try and rethink your logic here, vectorizing the approach.
  2. Using just-in-time compiler.
  3. (what I came to do in the end) Use Rcpp to translate your loopy-code in C/Cpp. This will give you a speed boost of a thousand times easy.