0

I have data on dates of visits and personal ids:

n     <- 1e6 
set.seed(42L)
DT <- data.table(id = sample(1:37000, n, replace=TRUE),
                 date = as.Date("1963-07-13", "%Y-%m-%d")
                 - sample(1:9000, n, replace=TRUE))

I'm adding a variable that ranks the visits for each person. Visit #1, #2, etc. If I can't differentiate between two visits they can be ranked in any (consistent) way.

After my last question (on efficiency) I realised I should learn how to use data.table. So my current solution is with data.table -- the only problem is the command takes few seconds to run.

> system.time(DT[, visit.n := rank(date, ties.method="first"), by = id]
+ )
   user  system elapsed 
   4.42    0.02    4.44 

I wonder if I'm doing something "wrong" or just need to be patient and move on.

Community
  • 1
  • 1
s_baldur
  • 29,441
  • 4
  • 36
  • 69
  • 1
    Try `setkey(DT, date); system.time(DT[, visit.n := 1:.N, by=id])` – Martin Schmelzer Jul 27 '16 at 15:51
  • 2
    Assuming you like your jumbled dates as they are, you can just use order(date) in i to sort only for the purpose of making the new col. And if you care about performance, you might consider integer storage formats for dates, so `system.time(DT[, date := as.IDate(date)][order(date), visit.n := 1:.N])` I see this taking ~ half the time of Martin's setkey. Also note that an author of the package says "In most cases therefore, there shouldn't be a need to set keys anymore." http://stackoverflow.com/a/20057411/ – Frank Jul 27 '16 at 17:32
  • 1
    Interesting. But your line of code does not yield the desired output, right? When adding `by=id` (so the output is correct), the performance effort again doubles... – Martin Schmelzer Jul 27 '16 at 18:12
  • when `by=id` it does seem to work fine, and indeed cut the computing time in half when using `as.IDate()`! – s_baldur Jul 27 '16 at 19:43
  • 1
    @Martin D'oh. Yeah, I initially had `rowid(id)` before realizing that was devel-only, and then edited wrongly to `1:.N` instead of `1:.N, by=id` – Frank Jul 28 '16 at 14:08

1 Answers1

4

Taken from my comment:

As @Frank pointed out, setkey is not necessary. Just using order(date) is sufficient to rank the dates. I also incorporated his point of saving the dates as integers.

system.time({
  DT[, date := as.IDate(date)][order(date), visit.n := 1:.N, by=id]
}) 

   user      system     elapsed
  0.126       0.005       0.132 
Martin Schmelzer
  • 23,283
  • 6
  • 73
  • 98
  • Great improvement! But I also tried 'setkey(DT, id)' and that was even faster. – s_baldur Jul 27 '16 at 15:56
  • 1
    The setkey belongs inside the timing; the 1:.N is now available in a rowid(id) function; and setkey is rarely necessary (see comment above). Just fyi. Oh, I guess rowid is still only in the devel version. – Frank Jul 27 '16 at 17:34
  • how does `setorder(DT, date)[, visit.n := 1:.N, by = id]` compare? – SymbolixAU Jul 28 '16 at 02:58