This kind of thing has been asked before, but not quite in this way that I can find.
Thread about creating sequential IDs, with several additional links
It's not hard to create identifiers in a sequence, but my data includes a temporal element that has thrown me for a loop. The following data is an imaginary dataset just to illustrate the problem in something tractable:
dput(walking_dat)
structure(list(neighborhood = structure(c(3L, 3L, 3L, 3L, 3L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Dinkytown", "Downtown",
"Uptown"), class = "factor"), street = structure(c(4L, 3L, 3L,
5L, 3L, 4L, 6L, 7L, 4L, 4L, 1L, 2L, 1L), .Label = c("12thAve",
"14thAve", "Dupont", "Hennepin", "Lyndale", "Marquette", "Nicolette"
), class = "factor"), sequence = c(1, 2, 3, 4, 5, 1, 2, 3, 4,
5, 1, 2, 3), visit = c(1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 2)), .Names = c("neighborhood",
"street", "sequence", "visit"), row.names = c(NA, -13L), class = "data.frame")
neighborhood street sequence visit
1 Uptown Hennepin 1 1
2 Uptown Dupont 2 1
3 Uptown Dupont 3 1
4 Uptown Lyndale 4 1
5 Uptown Dupont 5 2
6 Downtown Hennepin 1 1
7 Downtown Marquette 2 1
8 Downtown Nicolette 3 1
9 Downtown Hennepin 4 2
10 Downtown Hennepin 5 2
11 Dinkytown 12thAve 1 1
12 Dinkytown 14thAve 2 1
13 Dinkytown 12thAve 3 2
All data are, for the sake of imagination, from three individuals walking east in three neighborhoods of Minneapolis. Each row represents a time when their location was recorded. The first column is the neighborhood they are walking through. The second column is the intersection where they were located at each time-point. The third column is the sequence that these data occurred.
I want to create the visit
column that records sequential time-points at the same street, in the same neighborhood, as a single visit, and later return visits as the next visit. How do I create this sort of sequential identifier?
I was thinking this ave()
with FUN=seq_along
trick might work, but I can't find a way of combining the factors that gets me where I want to be.
Create a sequential number (counter) for rows within each group of a dataframe [duplicate]
Update: Uwe's solution works, but is broken if someone decides to stay at one intersection for all the measurements, which is what happened when I tried to put this to real data. If this happens, then the original number of rows are not returned to the final data.table. See what happens here:
dput(walking_dat_2)
structure(list(neighborhood = structure(c(3L, 3L, 3L, 3L, 3L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Dinkytown", "Downtown",
"Uptown"), class = "factor"), street2 = structure(c(2L, 2L, 2L,
2L, 2L, 2L, 3L, 4L, 2L, 2L, 1L, 1L, 1L), .Label = c("12thAve",
"Hennepin", "Marquette", "Nicolette"), class = "factor"), sequence = c(1,
2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3), visit_2 = c(1, 1, 1, 1,
1, 1, 1, 1, 2, 2, 1, 1, 1)), .Names = c("neighborhood", "street2",
"sequence", "visit_2"), row.names = c(NA, -13L), class = "data.frame")
neighborhood street2 sequence visit_2
1 Uptown Hennepin 1 1
2 Uptown Hennepin 2 1
3 Uptown Hennepin 3 1
4 Uptown Hennepin 4 1
5 Uptown Hennepin 5 1
6 Downtown Hennepin 1 1
7 Downtown Marquette 2 1
8 Downtown Nicolette 3 1
9 Downtown Hennepin 4 2
10 Downtown Hennepin 5 2
11 Dinkytown 12thAve 1 1
12 Dinkytown 12thAve 2 1
13 Dinkytown 12thAve 3 1
in this case, running Uwe's solution returns only 6 rows.
library(data.table)
setDT(walking_dat)[, visit_2 := rleid(neighborhood, street2)][
, unique(.SD, by = "visit_2")][
, visit_2 := rowid(neighborhood, street2)][
walking_dat, on = .(neighborhood, street2, sequence), roll = TRUE, visit_2 := x.visit_2][]
neighborhood street2 sequence visit visit_2
1: Uptown Hennepin 1 1 1
2: Downtown Hennepin 1 2 1
3: Downtown Marquette 2 3 1
4: Downtown Nicolette 3 4 1
5: Downtown Hennepin 4 5 2
6: Dinkytown 12thAve 1 6 1