R - Creating a unique sequence ID for rows in a data frame based on a value in a preceding row

Question

I'm working with a travel data set that gives information for discrete segments of each individual's trip, including the distance and purpose of the segment (e.g. to go to work, to go shopping, to return home). It's organized in a dataframe. I want to group these segments into what I'm referring to as "chained" or "looped" trips, increasing the chain/loop number each time there is a return home in the preceding segment.

In r, I've tried to construct a for loop with an ifelse contained in it that refers to variable values from the previous iteration and then fills in a new column that I already created with null values. (I know that sounds a little obtuse). Basically, I was trying to work out something similar to what I was able to do in Excel where I constructed a series of nested ifs. (=IF(DF2=DF1,IF(DG1=11,DI1+1,DI1),1). This is the formula from the third row, which refers to values in the immediately preceding row.

This is the dput code output for the data:

structure(list(h_id = c(1000002L, 1000002L, 1000002L, 1000002L, 
1000013L, 1000013L, 1000013L, 1000013L, 1000013L, 1000013L, 1000013L, 1000013L), p_ID = c(10000022L, 10000022L, 10000022L, 
10000022L, 10000131L, 10000131L, 10000132L,10000132L, 10000132L,10000132L,10000132L,10000132L), t_pur = c(6L, 11L, 7L, 11L, 
5L, 11L, 1L, 2L, 2L, 11L, 6L, 11L), t_distance = c(753.154936, 753.154936, 4681.630497, 
4681.630497, 616.0517311, 616.0517311, 9626,7984, 641.3675, 15076.6182, 21407.5585, 24273.3116, 24273.3116), X = c(1L, 1L, 2L, 2L, 
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L), Conc = c("10000022-1", "10000022-1", "10000022-2", "10000022-2", 
"10000131-1", "10000131-1", "10000132-1", "10000132-1", "10000132-1", "10000132-1", "10000132-2", "10000132-2" ), t_mode1 = c(1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 
1L), has_work = c(0, 0, 0, 0, 0, 0,1, 0, 0, 0, 0, 0), newcol = c(1, 1, 1, 1, 1, 
1, 1, 1, 1, 1,1, 1)), .Names = c("h_id", "p_ID", "t_pur", "t_distance", "X", "Conc", 
"t_mode1", "has_work", "newcol"), row.names = 3:14, class = "data.frame")

for (i in 1:nrow(loops4)) {
  ifelse(i == 1, loops4$newcol[i] <- 1,
         ifelse(loops4$p_ID[i-1]==loops4$p_ID,
                ifelse(loops4$t_pur[i-1]==11,
                       loops4$newcol[i] <- loops4$newcol[i-1]+1, 
                       loops4$newcol[i-1]), loops4$newcol[i] <- 1)) 
}

Here, loops4 is my dataframe. newcol is the one I created to hold the loop identifier. p_ID is the unique identifier associated with each individual, t_pur is the purpose of the trip segment and "11" is the value that corresponds to a purpose of "returning home". (I want to start a new loop identifier each time the preceding segment was return home). After that, I can concatenate the unique person ID and the loop ID to create a unique identifier for each loop. The first ifelse is just for the first record to assign a 1 to the loop identifier since there would be no preceding value for the loop to consult

I expected the loop to run through each row in the dataframe, checking first if the record referred to the same person as in the preceding record. If so, it should check to see if the trip purpose from the preceding row was "11" or not. If it is, it should add 1 to the preceding identifier to signify a new loop. If it's not preceded by a trip purpose of "11", it should assign the exact same loop identifier as the preceding row and move onto the next row. When it runs, first, it seems to take an enormous amount of time, and second, it fills everything with a 1, rather than incrementing and restarting as I expected.

I expected a dataframe like this. X is the properly calculated value from Excel. newcol is the column where I was trying to calculate the value in r in the newcol. The newcol values should be the same as X, but they're not. (I've updated the table below to reflect in newcol what I had hoped to see in the output).

   h_id     p_ID t_pur t_distance X       Conc t_mode1 has_work newcol
1000002 10000022     6      753.2 1 10000022-1       1        0      1
1000002 10000022    11      753.2 1 10000022-1       1        0      1
1000002 10000022     7     4681.6 2 10000022-2       1        0      2
1000002 10000022    11     4681.6 2 10000022-2       1        0      2
1000013 10000131     5      616.1 1 10000131-1       1        0      1
1000013 10000131    11      616.1 1 10000131-1       1        0      1
1000013 10000132     1     9626.8 1 10000132-1       1        1      1
1000013 10000132     2      641.4 1 10000132-1       1        0      1
1000013 10000132     2    15076.6 1 10000132-1       1        0      1
1000013 10000132    11    21407.6 1 10000132-1       1        0      1
1000013 10000132     6    24273.3 2 10000132-2       1        0      2
1000013 10000132    11    24273.3 2 10000132-2       1        0      2

UPDATE:

I went back and gave some thought to the assignment inside the ifelse construct based on the comment below and realized that didn't make much sense. So I tried rewriting the code, as follows:

for (i in 1:nrow(loops4)) {
  loops4$newcol[i] <- ifelse(i == 1, 1, ifelse (loops4$p_ID[i-1]==loops4$p_ID[i], ifelse(loops4$t_pur[i-1]==11, loops4$newcol[i-1]+1, loops$newcol[i-1], 1)))
}

But I got the same unexpected results.

UPDATED UPDATE:

There is maybe an error in my dput data from before. I had manually added a few values. I've pasted the new dput data below.

structure(list(h_id = c(1000002L, 1000002L, 1000002L, 1000002L, 
1000013L, 1000013L, 1000013L, 1000013L, 1000013L, 1000013L, 1000013L, 
1000013L), p_ID = c(10000022L, 10000022L, 10000022L, 10000022L, 
10000131L, 10000131L, 10000132L, 10000132L, 10000132L, 10000132L, 
10000132L, 10000132L), t_pur = c(6L, 11L, 7L, 11L, 5L, 11L, 1L, 
2L, 2L, 11L, 6L, 11L), t_distance = c(753.154936, 753.154936, 
4681.630497, 4681.630497, 616.0517311, 616.0517311, 9626.798385, 
641.3674532, 15076.61817, 21407.55851, 24273.31161, 24273.31161
), X = c(1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L), Conc = c("10000022-1", 
"10000022-1", "10000022-2", "10000022-2", "10000131-1", "10000131-1", 
"10000132-1", "10000132-1", "10000132-1", "10000132-1", "10000132-2", 
"10000132-2"), t_mode1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), has_work = c(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0), 
    newcol = c(1L, 1L, 2L, 2L, 1L, 1L, 0L, 0L, 1L, 1L, 2L, 2L
    )), .Names = c("h_id", "p_ID", "t_pur", "t_distance", "X", 
"Conc", "t_mode1", "has_work", "newcol"), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -12L))

And I've tried the dplyr approach helpfully suggested below.

loops_good <- loops5 %>%
  group_by(h_id) %>% group_by (p_ID) %>%
  mutate(newcol = cumsum(lead(t_pur, default = 0) == 11)) %>%
  ungroup()

The results I get are almost what I'd been hoping for. But not quite. Rows 7 through 10 should all be grouped together with the same identifier in newcol because there is no intervening "11". The t_pur sequence is 1, 2, 2, 11. But in the output below, the newcol values come in as 0, 0, 1, 1.

Basically, I'm trying to link together individual segments into overall trips, starting over each time there's a return home signified by a t_pur == "11". Sometimes, it's just an out-and-back (two segments). But sometimes, as in rows 7-10, it's 4 segments.

Actual output:

      h_id     p_ID t_pur t_distance     X Conc       t_mode1 has_work newcol
     <int>    <int> <int>      <dbl> <int> <chr>        <int>    <dbl>  <int>
 1 1000002 10000022     6       753.     1 10000022-1       1        0      1
 2 1000002 10000022    11       753.     1 10000022-1       1        0      1
 3 1000002 10000022     7      4682.     2 10000022-2       1        0      2
 4 1000002 10000022    11      4682.     2 10000022-2       1        0      2
 5 1000013 10000131     5       616.     1 10000131-1       1        0      1
 6 1000013 10000131    11       616.     1 10000131-1       1        0      1
 7 1000013 10000132     1      9627.     1 10000132-1       1        1      0
 8 1000013 10000132     2       641.     1 10000132-1       1        0      0
 9 1000013 10000132     2     15077.     1 10000132-1       1        0      1
10 1000013 10000132    11     21408.     1 10000132-1       1        0      1
11 1000013 10000132     6     24273.     2 10000132-2       1        0      2
12 1000013 10000132    11     24273.     2 10000132-2       1        0      2

Hoped for output:

      h_id     p_ID t_pur t_distance     X Conc       t_mode1 has_work newcol
     <int>    <int> <int>      <dbl> <int> <chr>        <int>    <dbl>  <int>
 1 1000002 10000022     6       753.     1 10000022-1       1        0      1
 2 1000002 10000022    11       753.     1 10000022-1       1        0      1
 3 1000002 10000022     7      4682.     2 10000022-2       1        0      2
 4 1000002 10000022    11      4682.     2 10000022-2       1        0      2
 5 1000013 10000131     5       616.     1 10000131-1       1        0      1
 6 1000013 10000131    11       616.     1 10000131-1       1        0      1
 7 1000013 10000132     1      9627.     1 10000132-1       1        1      1
 8 1000013 10000132     2       641.     1 10000132-1       1        0      1
 9 1000013 10000132     2     15077.     1 10000132-1       1        0      1
10 1000013 10000132    11     21408.     1 10000132-1       1        0      1
11 1000013 10000132     6     24273.     2 10000132-2       1        0      2
12 1000013 10000132    11     24273.     2 10000132-2       1        0      2

I cannot imagine a situation where assignment inside `ifelse` (single or nested) makes sense to me. Consider making this question *reproducible*, including unambiguous sample data (`dput(head(loops4))`) and expected output. Refs: https://stackoverflow.com/questions/5963269, https://stackoverflow.com/help/mcve, and https://stackoverflow.com/tags/r/info. — r2evans, May 24 '19 at 16:43
Thanks, r2evans, I'm adding the output from the (dput(head(loops4)) code you mentioned. I'll also edit the question to see if I can show what I was hoping to get as the output. — James DeWeese, May 24 '19 at 17:30
You are trying to calculate `newcol`, but they are all 1 (which is unhelpful). Also, your sample input has 6 rows but your "expected output" has 12. If `newcol` is incorrect and that is why you are asking this question, then *please put in what you **expect***, so that we can fix the code. — r2evans, May 24 '19 at 17:48
Your row-10 "11" is not triggering a change, is that intentional? — r2evans, May 24 '19 at 19:01

r2evans · Accepted Answer · 2019-05-24T21:05:53.610

0

I think I understand what you want ... here's a stab.

Explanation:

when you say "checking first if the record referred to the same person", that tells me you should be grouping by that variable, ergo dplyr::group_by, data.table's by=, and base R's by()
using a simple lead or shift, we include logic of the next row's value in order to assign it to this row; b/c both of those functions do not know what to do when looking at the last row for a particular person, we have to provide the default value with default=, fill=, or manually with c(...[-1], 0)

Edited for updated logic and understanding.

`dplyr`

library(dplyr)
x %>%
  group_by(p_ID) %>%
  mutate(newcol = cumsum(lag(t_pur == 11, default = TRUE))) %>%
  ungroup()
# # A tibble: 12 x 9
#       h_id     p_ID t_pur t_distance     X Conc       t_mode1 has_work newcol
#      <int>    <int> <int>      <dbl> <int> <chr>        <int>    <dbl>  <int>
#  1 1000002 10000022     6       753.     1 10000022-1       1        0      1
#  2 1000002 10000022    11       753.     1 10000022-1       1        0      1
#  3 1000002 10000022     7      4682.     2 10000022-2       1        0      2
#  4 1000002 10000022    11      4682.     2 10000022-2       1        0      2
#  5 1000013 10000131     5       616.     1 10000131-1       1        0      1
#  6 1000013 10000131    11       616.     1 10000131-1       1        0      1
#  7 1000013 10000132     1      9627.     1 10000132-1       1        1      1
#  8 1000013 10000132     2       641.     1 10000132-1       1        0      1
#  9 1000013 10000132     2     15077.     1 10000132-1       1        0      1
# 10 1000013 10000132    11     21408.     1 10000132-1       1        0      1
# 11 1000013 10000132     6     24273.     2 10000132-2       1        0      2
# 12 1000013 10000132    11     24273.     2 10000132-2       1        0      2

`data.table`

library(data.table)
xDT <- x
xDT$newcol <- NULL
setDT(xDT)
xDT[, newcol := cumsum(shift(t_pur == 11, type = "lag", fill = TRUE)), by = "p_ID"]

Base R

do.call(rbind.data.frame,
        c(by(x, x$p_ID, function(z)
          within(z, { newcol = cumsum(c(TRUE, head(z$t_pur, n=-1) == 11)) } )),
          stringsAsFactors = FALSE))

edited May 24 '19 at 21:05

answered May 24 '19 at 18:09

r2evans

141,215
6
77
149

r2evans, thanks, this is incredibly helpful. I made some adjustments, adding the missing data to the set to ensure there were a matching number of rows. I ran the code you suggested with dplyr. It mostly worked. (I'm enough of a newbie that I don't really understand it, but I'm working my way through the help files and tutorials to pick apart how it works.) When I ran it with all 12 observations, the count ended up resetting to zero. I've pasted that above. But I think that should work. – James DeWeese May 24 '19 at 19:18
The count will start at 0 if the first two rows don't make the match. If that's a problem, you can always add `newcol = ifelse(min(newcol) == 0, 1, 0) + newcol` (or a shortcut: `newcol = newcol = (min(newcol) == 0)`) as a second argument to the `mutate` call. This can be adapted to the `data.table` and base calls as well, if that's needed. – r2evans May 24 '19 at 20:06
Thanks again, r2evans, I don't mind sequences setting to zero since that will still give me a unique "chain" identifier for each set of trips when I concatenate it with the person id (p_ID). One thing I don't quite understand is why the code doesn't pick up on the fact that that the segments in rows 7 through 10 are connected (since there's no "11" return home. Instead it goes from 0 to 1 between rows 8 and 9. I'm still teaching myself the dplyr to understand how this code works, but do you know why it doesn't recognize the rows from 7-10 as a four-segment chain? – James DeWeese May 24 '19 at 20:56
Yes, because your explanation was that a block started on the row prior to an "11". I think what you mean is that it starts a new block immediately after "11", is that right? – r2evans May 24 '19 at 21:00
See my edits, I think it's what you're looking for this time. – r2evans May 24 '19 at 21:06
Oh, I see where I created the confusion. The block starts on the row after the "11". (To figure that out in Excel, I used nested ifs to evaluate the value of the immediately preceding row to see if was an "11" and, if so, then I started the segment in the current row.) I'm still not sure I understand why it would transition from o to 1 for newcol between rows 8 and 9 when there's no 11 in t_pur there. – James DeWeese May 24 '19 at 21:09
1

Sorry, I got called away and then had pick-up duty. I'm going to sit down with the code again this evening when I'm back in front of the computer! Thanks again for all your help. – James DeWeese May 24 '19 at 23:18
That's exactly it! It works perfectly. Thanks a million! I'd been banging my head against the wall (or keyboard). Purely as a matter of learning/curiosity since you've solved my immediate data problem, do you think it would have been possible for me to eventually hit on a makeshift solution using nested-if conditions to mimic what I'd done in Excel, or would it have actually been impossible? – James DeWeese May 25 '19 at 11:25
Certainly possible. Using grouping versus `ifelse` is as much efficiency as readability and such, but it can work your way too. Two things, though: if I'm nesting more than two `ifelse`s, I typically shift away and use other methods instead (e.g., grouping aggregation, `merge`); and there are times when even one `ifelse` is not safe (e.g., `POSIXt`). – r2evans May 25 '19 at 16:26

R - Creating a unique sequence ID for rows in a data frame based on a value in a preceding row

1 Answers1

dplyr

data.table

Base R

`dplyr`

`data.table`