0

How to assign unique id to each records sequence?

For example I have the following table:

time    machine visitor
11:30   A       123
11:31   A       123
11:33   A       123
11:34   A       256
11:35   A       256
11:36   A       256
11:37   A       256
11:38   A       789
11:40   A       789
11:42   A       789
11:50   A       123
11:51   A       123

And as a result I would like the following session id to be added to each record:

time    machine visitor session
11:30   A       123     1
11:31   A       123     1
11:33   A       123     1
11:34   A       256     2
11:35   A       256     2
11:36   A       256     2
11:37   A       256     2
11:38   A       789     3
11:40   A       789     3
11:42   A       789     3
11:50   A       123     4
11:51   A       123     4

I wrote a loop that is supposed to do that but it's way too slow:

session = 1
for (i in 2:nrow(df)) {
  if(df[i, ]$visitor != df[i-1, ]$visitor)
  {
    session = session + 1
  }
  df[i, ]$session = session
}
souser
  • 3
  • 2
  • 1
    wouldn't the variable `visitor` do the job ? – Cath Jan 14 '16 at 12:51
  • @Cath no, because the visitor can use a machine multiple times, making them separate sessions – souser Jan 14 '16 at 12:57
  • 1
    I thought there was something like that ;-) you should state that clearly in your question and in your example data.frame – Cath Jan 14 '16 at 12:59
  • @david-arenburg I don't think it's a duplicate since two records with the same visitor and machine can have different session number (if they are separated by another machine usage) – souser Jan 14 '16 at 13:23
  • @souser Even that would be a dupe. BTW, have you tried my solutions? – akrun Jan 14 '16 at 13:29
  • @akrun Yes, I checked them but they seem to assign the same session to usages with the same visitor number but separated by another visitor machine usage. – souser Jan 14 '16 at 13:53
  • I don't think so, I changed to group by .(visitor, machine). Also, the other solutions work too. – akrun Jan 14 '16 at 13:55
  • @akrun Well here's an example (session2 is your solution): https://imgur.com/FhFznrs – souser Jan 14 '16 at 14:09
  • 1
    Ok, my bad, 'setDT(df1)[, session := rleid(visitor)]' works ok – souser Jan 14 '16 at 14:15

2 Answers2

1

We can try

library(data.table)
setDT(df1)[, session:= .GRP, .(visitor, machine)]

Or without any grouping, using base R

v1 <- do.call(paste, df1[c('visitor', 'machine')])
df1$session <- match(v1, unique(v1))

If the 'visitor' repeats after some breaks, and we can 'session' to be different

setDT(df1)[, session := rleid(visitor)]
df1
#    time machine visitor session
#1: 11:30       A     123       1
#2: 11:31       A     123       1
#3: 11:33       A     123       1
#4: 11:34       A     256       2
#5: 11:35       A     256       2
#6: 11:36       A     256       2
#7: 11:37       A     256       2
#8: 11:38       A     789       3
#9: 11:40       A     789       3
#10:11:42       A     789       3
akrun
  • 874,273
  • 37
  • 540
  • 662
0

Probably not the most legible way to do this but you can do the following:

df$session <- cumsum(c(TRUE,as.logical(diff(df$visitor))))

To break it down a little:

> diff(df$visitor) #Difference between values in each row.
[1]   0   0 133   0   0   0 533   0   0
> c(TRUE,as.logical(diff(df$visitor))) #Converts to logical and add a lag:
 [1] TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
> cumsum(c(TRUE,as.logical(diff(df$visitor)))) #Then cumulative sum.
[1] 1 1 1 2 2 2 2 3 3 3
plannapus
  • 18,529
  • 4
  • 72
  • 94
  • Thanks! I thought about diff function, but didn't know about cumsum. – souser Jan 14 '16 at 13:19
  • 1
    Correct me if I'm wrong but I think this wouldn't work as expected if the visitor remained the same in two consecutive rows but the machine would change. – talat Jan 14 '16 at 13:28
  • @docendodiscimus Indeed: since it wasn t mentioned in the question, i based my answer on the OP's first attempt. This does exactly the same operation than the OP's loop, but more efficiently. – plannapus Jan 14 '16 at 14:03