Creating unique groups in sequential data that repeats through time

Question

This kind of thing has been asked before, but not quite in this way that I can find.

Thread about creating sequential IDs, with several additional links

It's not hard to create identifiers in a sequence, but my data includes a temporal element that has thrown me for a loop. The following data is an imaginary dataset just to illustrate the problem in something tractable:

    dput(walking_dat)
structure(list(neighborhood = structure(c(3L, 3L, 3L, 3L, 3L, 
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Dinkytown", "Downtown", 
"Uptown"), class = "factor"), street = structure(c(4L, 3L, 3L, 
5L, 3L, 4L, 6L, 7L, 4L, 4L, 1L, 2L, 1L), .Label = c("12thAve", 
"14thAve", "Dupont", "Hennepin", "Lyndale", "Marquette", "Nicolette"
), class = "factor"), sequence = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 
5, 1, 2, 3), visit = c(1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 2)), .Names = c("neighborhood", 
"street", "sequence", "visit"), row.names = c(NA, -13L), class = "data.frame")

   neighborhood    street sequence visit
1        Uptown  Hennepin        1     1
2        Uptown    Dupont        2     1
3        Uptown    Dupont        3     1
4        Uptown   Lyndale        4     1
5        Uptown    Dupont        5     2
6      Downtown  Hennepin        1     1
7      Downtown Marquette        2     1
8      Downtown Nicolette        3     1
9      Downtown  Hennepin        4     2
10     Downtown  Hennepin        5     2
11    Dinkytown   12thAve        1     1
12    Dinkytown   14thAve        2     1
13    Dinkytown   12thAve        3     2

All data are, for the sake of imagination, from three individuals walking east in three neighborhoods of Minneapolis. Each row represents a time when their location was recorded. The first column is the neighborhood they are walking through. The second column is the intersection where they were located at each time-point. The third column is the sequence that these data occurred.

I want to create the visit column that records sequential time-points at the same street, in the same neighborhood, as a single visit, and later return visits as the next visit. How do I create this sort of sequential identifier?

I was thinking this ave() with FUN=seq_along trick might work, but I can't find a way of combining the factors that gets me where I want to be.

Create a sequential number (counter) for rows within each group of a dataframe [duplicate]

Update: Uwe's solution works, but is broken if someone decides to stay at one intersection for all the measurements, which is what happened when I tried to put this to real data. If this happens, then the original number of rows are not returned to the final data.table. See what happens here:

dput(walking_dat_2)
structure(list(neighborhood = structure(c(3L, 3L, 3L, 3L, 3L, 
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Dinkytown", "Downtown", 
"Uptown"), class = "factor"), street2 = structure(c(2L, 2L, 2L, 
2L, 2L, 2L, 3L, 4L, 2L, 2L, 1L, 1L, 1L), .Label = c("12thAve", 
"Hennepin", "Marquette", "Nicolette"), class = "factor"), sequence = c(1, 
2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3), visit_2 = c(1, 1, 1, 1, 
1, 1, 1, 1, 2, 2, 1, 1, 1)), .Names = c("neighborhood", "street2", 
"sequence", "visit_2"), row.names = c(NA, -13L), class = "data.frame")

   neighborhood   street2 sequence visit_2
1        Uptown  Hennepin        1       1
2        Uptown  Hennepin        2       1
3        Uptown  Hennepin        3       1
4        Uptown  Hennepin        4       1
5        Uptown  Hennepin        5       1
6      Downtown  Hennepin        1       1
7      Downtown Marquette        2       1
8      Downtown Nicolette        3       1
9      Downtown  Hennepin        4       2
10     Downtown  Hennepin        5       2
11    Dinkytown   12thAve        1       1
12    Dinkytown   12thAve        2       1
13    Dinkytown   12thAve        3       1

in this case, running Uwe's solution returns only 6 rows.

library(data.table)
setDT(walking_dat)[, visit_2 := rleid(neighborhood, street2)][
     , unique(.SD, by = "visit_2")][
         , visit_2 := rowid(neighborhood, street2)][
             walking_dat, on = .(neighborhood, street2, sequence), roll = TRUE, visit_2 := x.visit_2][]

   neighborhood   street2 sequence visit visit_2
1:       Uptown  Hennepin        1     1       1
2:     Downtown  Hennepin        1     2       1
3:     Downtown Marquette        2     3       1
4:     Downtown Nicolette        3     4       1
5:     Downtown  Hennepin        4     5       2
6:    Dinkytown   12thAve        1     6       1

In your example `Uptown Dupont` has `visit` values of 1, 1, 2. Shouldn't that be 1, 2, 3? — Hack-R, Sep 30 '17 at 22:30
No, the idea is that if the person is recorded there twice in a row it's assumed that that is one visit. Once they are recorded somewhere else then if they come back to the same intersection it is their second visit. — JHegg, Sep 30 '17 at 22:36
That is exactly the problem I'm having though, dealing with those repeated time-points where a person is staying in the same location. The `visit` column in my example is correct for what I am looking for. — JHegg, Sep 30 '17 at 22:39

Hack-R · Answer 1 · 2017-09-30T22:44:49.573

1

# Not required, but convenient:
walking_dat$combo <- paste(walking_dat$neighborhood, walking_dat$street)

# Place holder:
walking_dat$visit <- NA

# Create it:
for(i in 1:nrow(walking_dat)){
  if(i %in% row.names(walking_dat[with(walking_dat, c(TRUE, diff(as.numeric(interaction(neighborhood, street))) != 0)), ])){
    walking_dat$visit[i] <- sum(walking_dat$combo[with(walking_dat, c(TRUE, diff(as.numeric(interaction(neighborhood, street))) != 0))][1:i]==walking_dat$combo[i], na.rm=T)
  } else{
    walking_dat$visit[i] <- 1
  }
}

walking_dat

   neighborhood    street sequence visit              combo
1        Uptown  Hennepin        1     1    Uptown Hennepin
2        Uptown    Dupont        2     1      Uptown Dupont
3        Uptown    Dupont        3     1      Uptown Dupont
4        Uptown   Lyndale        4     1     Uptown Lyndale
5        Uptown    Dupont        5     2      Uptown Dupont
6      Downtown  Hennepin        1     1  Downtown Hennepin
7      Downtown Marquette        2     1 Downtown Marquette
8      Downtown Nicolette        3     1 Downtown Nicolette
9      Downtown  Hennepin        4     2  Downtown Hennepin
10     Downtown  Hennepin        5     1  Downtown Hennepin
11    Dinkytown   12thAve        1     2  Dinkytown 12thAve
12    Dinkytown   14thAve        2     1  Dinkytown 14thAve
13    Dinkytown   12thAve        3     2  Dinkytown 12thAve

edited Sep 30 '17 at 22:44

answered Sep 30 '17 at 22:30

Hack-R

22,422
14
75
131

Dang! It's going to take me a minute to digest how that for loop is doing it's thing. – JHegg Sep 30 '17 at 22:52
I can help. It's pretty simple once you understand that the indexing with the `with` is just creating a view of the data with no consecutive duplicate rows. So it's saying that, ignoring consecutive duplicate rows, how many times have we seen this combination of `neighborhood` and `street`. – Hack-R Sep 30 '17 at 23:00
This explains the indexing: https://stackoverflow.com/questions/14056153/remove-consecutive-duplicates-from-dataframe . If this answers your question please alert future readers and give me credit by checking the green checkmark by the answer :) – Hack-R Sep 30 '17 at 23:00
That doesn't quite do it. It works from row 1:9, but fails after that. – JHegg Sep 30 '17 at 23:08
1

Please, can you double check row 10. Shouldn't this visit to Downtown / Hennepin have the same visit number 2 as the preceeding row 9? – Uwe Sep 30 '17 at 23:08
1

I'm giving an up vote for a good effort that didn't quite get there. – JHegg Oct 01 '17 at 00:10
Thanks yes I can fix that in the else statement but seems @Uwe already solved the problem to your satisfaction and I have a fever so I am just going to let it go – Hack-R Oct 01 '17 at 00:55

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

The difficulty here is that subsequent recordings to the same street in the same neighborhood should be counted as one visit. This requires to collapse these rows into one, count the visits to different neighborhoods & streets and finally expand this to the original number of rows.

Note that column visit containing the expected result is not overwritten but kept for comparison with the computed visit_new column.

library(data.table)
setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][
  , unique(.SD, by = "visit_new")][
    , visit_new := rowid(neighborhood, street)][
      walking_dat, on = .(neighborhood, street, sequence), roll = TRUE, .SD]

    neighborhood    street sequence visit visit_new
 1:       Uptown  Hennepin        1     1         1
 2:       Uptown    Dupont        2     1         1
 3:       Uptown    Dupont        3     1         1
 4:       Uptown   Lyndale        4     1         1
 5:       Uptown    Dupont        5     2         2
 6:     Downtown  Hennepin        1     1         1
 7:     Downtown Marquette        2     1         1
 8:     Downtown Nicolette        3     1         1
 9:     Downtown  Hennepin        4     2         2
10:     Downtown  Hennepin        5     2         2
11:    Dinkytown   12thAve        1     1         1
12:    Dinkytown   14thAve        2     1         1
13:    Dinkytown   12thAve        3     2         2

Explanation step by step

DF is coerced to data.table. The rleid() function creates unique numbers for changes in neighborhood & street.

 setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][]

    neighborhood    street sequence visit visit_new
 1:       Uptown  Hennepin        1     1         1
 2:       Uptown    Dupont        2     1         2
 3:       Uptown    Dupont        3     1         2
 4:       Uptown   Lyndale        4     1         3
 5:       Uptown    Dupont        5     2         4
 6:     Downtown  Hennepin        1     1         5
 7:     Downtown Marquette        2     1         6
 8:     Downtown Nicolette        3     1         7
 9:     Downtown  Hennepin        4     2         8
10:     Downtown  Hennepin        5     2         8
11:    Dinkytown   12thAve        1     1         9
12:    Dinkytown   14thAve        2     1        10
13:    Dinkytown   12thAve        3     2        11

Note that rows 2 & 3 are repeated as well as rows 9 & 10. The duplicates are removed in the next step which creates a new, temporary data.table object:

setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][
  , unique(.SD, by = "visit_new")][]

    neighborhood    street sequence visit visit_new
 1:       Uptown  Hennepin        1     1         1
 2:       Uptown    Dupont        2     1         2
 3:       Uptown   Lyndale        4     1         3
 4:       Uptown    Dupont        5     2         4
 5:     Downtown  Hennepin        1     1         5
 6:     Downtown Marquette        2     1         6
 7:     Downtown Nicolette        3     1         7
 8:     Downtown  Hennepin        4     2         8
 9:    Dinkytown   12thAve        1     1         9
10:    Dinkytown   14thAve        2     1        10
11:    Dinkytown   12thAve        3     2        11

Now, we can number the visits to distinct neighborhoods and streets using the rowid() function:

setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][
  , unique(.SD, by = "visit_new")][
    , visit_new := rowid(neighborhood, street)][]

    neighborhood    street sequence visit visit_new
 1:       Uptown  Hennepin        1     1         1
 2:       Uptown    Dupont        2     1         1
 3:       Uptown   Lyndale        4     1         1
 4:       Uptown    Dupont        5     2         2
 5:     Downtown  Hennepin        1     1         1
 6:     Downtown Marquette        2     1         1
 7:     Downtown Nicolette        3     1         1
 8:     Downtown  Hennepin        4     2         2
 9:    Dinkytown   12thAve        1     1         1
10:    Dinkytown   14thAve        2     1         1
11:    Dinkytown   12thAve        3     2         2

Finally, we need to expand the result to the original number of rows again. This is accomplished by a rolling join of the temporary data.table with the original DF (all rows included):

setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][
  , unique(.SD, by = "visit_new")][
    , visit_new := rowid(neighborhood, street)][
      walking_dat, on = .(neighborhood, street, sequence), roll = TRUE, .SD]

Perhaps, it might be worth to note that visit_new is used and re-used to hold temporary data through various stages until the final update.

New data set

The fixed code works also with the second data set provided by the OP:

walking_dat_2 <-
structure(list(neighborhood = structure(c(3L, 3L, 3L, 3L, 3L, 
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Dinkytown", "Downtown", 
"Uptown"), class = "factor"), street = structure(c(2L, 2L, 2L, 
2L, 2L, 2L, 3L, 4L, 2L, 2L, 1L, 1L, 1L), .Label = c("12thAve", 
"Hennepin", "Marquette", "Nicolette"), class = "factor"), sequence = c(1, 
2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3), visit = c(1, 1, 1, 1, 1, 
1, 1, 1, 2, 2, 1, 1, 1), visit_new = c(1L, 1L, 1L, 1L, 1L, 2L, 
3L, 4L, 5L, 5L, 6L, 6L, 6L)), .Names = c("neighborhood", "street", 
"sequence", "visit", "visit_new"), row.names = c(NA, -13L), class = "data.frame")

setDT(walking_dat_2)[, visit_new := rleid(neighborhood, street)][
  , unique(.SD, by = "visit_new")][
    , visit_new := rowid(neighborhood, street)][
      walking_dat_2, on = .(neighborhood, street, sequence), 
      roll = TRUE, .SD]

    neighborhood    street sequence visit visit_new
 1:       Uptown  Hennepin        1     1         1
 2:       Uptown  Hennepin        2     1         1
 3:       Uptown  Hennepin        3     1         1
 4:       Uptown  Hennepin        4     1         1
 5:       Uptown  Hennepin        5     1         1
 6:     Downtown  Hennepin        1     1         1
 7:     Downtown Marquette        2     1         1
 8:     Downtown Nicolette        3     1         1
 9:     Downtown  Hennepin        4     2         2
10:     Downtown  Hennepin        5     2         2
11:    Dinkytown   12thAve        1     1         1
12:    Dinkytown   12thAve        2     1         1
13:    Dinkytown   12thAve        3     1         1

I think you misinterpreted the question. I am wanting to know how to create the `visit` column from the other data. (Nevermind, your edit clarifies what you're trying to do). — JHegg, Sep 30 '17 at 22:42
Please, see the `visit_new` column which is computed from the other columns. I have left your expected `visit`column just for comparison. — Uwe, Sep 30 '17 at 22:44
Yes, this seems to work. You were correct about `visit` in row 12 and 13, those values have been corrected. I'm going to have to really dig in to figure out what is being done here, I'm not very familiar with the `data.table` package. — JHegg, Sep 30 '17 at 23:29
you might want to add back in `library(data.table)` or future visitors might be really confused. — JHegg, Sep 30 '17 at 23:39
Excellent explanation, I was about halfway through Googling these steps and your explanation made it crystal clear. Bingo! — JHegg, Oct 01 '17 at 00:08
I had to un-click the solution, I found something that breaks this. If someone sits in one intersection for every measurement...the rolling join only returns 1 row. See my updated question. — JHegg, Oct 01 '17 at 04:41
Thanks for the update. This works perfectly with my bigger dataset as well. Thanks for all of your work in making this work and providing a detailed explanation. — JHegg, Oct 01 '17 at 18:49

Creating unique groups in sequential data that repeats through time

2 Answers2

Explanation step by step

New data set

Linked