How to assign unique id to each records sequence?

Question

For example I have the following table:

time    machine visitor
11:30   A       123
11:31   A       123
11:33   A       123
11:34   A       256
11:35   A       256
11:36   A       256
11:37   A       256
11:38   A       789
11:40   A       789
11:42   A       789
11:50   A       123
11:51   A       123

And as a result I would like the following session id to be added to each record:

time    machine visitor session
11:30   A       123     1
11:31   A       123     1
11:33   A       123     1
11:34   A       256     2
11:35   A       256     2
11:36   A       256     2
11:37   A       256     2
11:38   A       789     3
11:40   A       789     3
11:42   A       789     3
11:50   A       123     4
11:51   A       123     4

I wrote a loop that is supposed to do that but it's way too slow:

session = 1
for (i in 2:nrow(df)) {
  if(df[i, ]$visitor != df[i-1, ]$visitor)
  {
    session = session + 1
  }
  df[i, ]$session = session
}

@Cath no, because the visitor can use a machine multiple times, making them separate sessions — souser, Jan 14 '16 at 12:57
I thought there was something like that ;-) you should state that clearly in your question and in your example data.frame — Cath, Jan 14 '16 at 12:59
@david-arenburg I don't think it's a duplicate since two records with the same visitor and machine can have different session number (if they are separated by another machine usage) — souser, Jan 14 '16 at 13:23
@souser Even that would be a dupe. BTW, have you tried my solutions? — akrun, Jan 14 '16 at 13:29
@akrun Yes, I checked them but they seem to assign the same session to usages with the same visitor number but separated by another visitor machine usage. — souser, Jan 14 '16 at 13:53
I don't think so, I changed to group by .(visitor, machine). Also, the other solutions work too. — akrun, Jan 14 '16 at 13:55
@akrun Well here's an example (session2 is your solution): https://imgur.com/FhFznrs — souser, Jan 14 '16 at 14:09
Ok, my bad, 'setDT(df1)[, session := rleid(visitor)]' works ok — souser, Jan 14 '16 at 14:15

akrun · Answer 1 · 2016-01-14T13:03:18.417

We can try

library(data.table)
setDT(df1)[, session:= .GRP, .(visitor, machine)]

Or without any grouping, using base R

v1 <- do.call(paste, df1[c('visitor', 'machine')])
df1$session <- match(v1, unique(v1))

If the 'visitor' repeats after some breaks, and we can 'session' to be different

setDT(df1)[, session := rleid(visitor)]
df1
#    time machine visitor session
#1: 11:30       A     123       1
#2: 11:31       A     123       1
#3: 11:33       A     123       1
#4: 11:34       A     256       2
#5: 11:35       A     256       2
#6: 11:36       A     256       2
#7: 11:37       A     256       2
#8: 11:38       A     789       3
#9: 11:40       A     789       3
#10:11:42       A     789       3

score 0 · Accepted Answer · answered Jan 14 '16 at 13:01

0

Probably not the most legible way to do this but you can do the following:

df$session <- cumsum(c(TRUE,as.logical(diff(df$visitor))))

To break it down a little:

> diff(df$visitor) #Difference between values in each row.
[1]   0   0 133   0   0   0 533   0   0
> c(TRUE,as.logical(diff(df$visitor))) #Converts to logical and add a lag:
 [1] TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
> cumsum(c(TRUE,as.logical(diff(df$visitor)))) #Then cumulative sum.
[1] 1 1 1 2 2 2 2 3 3 3

answered Jan 14 '16 at 13:01

plannapus

18,529
4
72
94

Thanks! I thought about diff function, but didn't know about cumsum. – souser Jan 14 '16 at 13:19
1

Correct me if I'm wrong but I think this wouldn't work as expected if the visitor remained the same in two consecutive rows but the machine would change. – talat Jan 14 '16 at 13:28
@docendodiscimus Indeed: since it wasn t mentioned in the question, i based my answer on the OP's first attempt. This does exactly the same operation than the OP's loop, but more efficiently. – plannapus Jan 14 '16 at 14:03

How to assign unique id to each records sequence?

2 Answers2