I'm working with a travel data set that gives information for discrete segments of each individual's trip, including the distance and purpose of the segment (e.g. to go to work, to go shopping, to return home). It's organized in a dataframe. I want to group these segments into what I'm referring to as "chained" or "looped" trips, increasing the chain/loop number each time there is a return home in the preceding segment.
In r, I've tried to construct a for loop with an ifelse contained in it that refers to variable values from the previous iteration and then fills in a new column that I already created with null values. (I know that sounds a little obtuse). Basically, I was trying to work out something similar to what I was able to do in Excel where I constructed a series of nested ifs. (=IF(DF2=DF1,IF(DG1=11,DI1+1,DI1),1). This is the formula from the third row, which refers to values in the immediately preceding row.
This is the dput code output for the data:
structure(list(h_id = c(1000002L, 1000002L, 1000002L, 1000002L,
1000013L, 1000013L, 1000013L, 1000013L, 1000013L, 1000013L, 1000013L, 1000013L), p_ID = c(10000022L, 10000022L, 10000022L,
10000022L, 10000131L, 10000131L, 10000132L,10000132L, 10000132L,10000132L,10000132L,10000132L), t_pur = c(6L, 11L, 7L, 11L,
5L, 11L, 1L, 2L, 2L, 11L, 6L, 11L), t_distance = c(753.154936, 753.154936, 4681.630497,
4681.630497, 616.0517311, 616.0517311, 9626,7984, 641.3675, 15076.6182, 21407.5585, 24273.3116, 24273.3116), X = c(1L, 1L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L), Conc = c("10000022-1", "10000022-1", "10000022-2", "10000022-2",
"10000131-1", "10000131-1", "10000132-1", "10000132-1", "10000132-1", "10000132-1", "10000132-2", "10000132-2" ), t_mode1 = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L,
1L), has_work = c(0, 0, 0, 0, 0, 0,1, 0, 0, 0, 0, 0), newcol = c(1, 1, 1, 1, 1,
1, 1, 1, 1, 1,1, 1)), .Names = c("h_id", "p_ID", "t_pur", "t_distance", "X", "Conc",
"t_mode1", "has_work", "newcol"), row.names = 3:14, class = "data.frame")
for (i in 1:nrow(loops4)) {
ifelse(i == 1, loops4$newcol[i] <- 1,
ifelse(loops4$p_ID[i-1]==loops4$p_ID,
ifelse(loops4$t_pur[i-1]==11,
loops4$newcol[i] <- loops4$newcol[i-1]+1,
loops4$newcol[i-1]), loops4$newcol[i] <- 1))
}
Here, loops4 is my dataframe. newcol is the one I created to hold the loop identifier. p_ID is the unique identifier associated with each individual, t_pur is the purpose of the trip segment and "11" is the value that corresponds to a purpose of "returning home". (I want to start a new loop identifier each time the preceding segment was return home). After that, I can concatenate the unique person ID and the loop ID to create a unique identifier for each loop. The first ifelse is just for the first record to assign a 1 to the loop identifier since there would be no preceding value for the loop to consult
I expected the loop to run through each row in the dataframe, checking first if the record referred to the same person as in the preceding record. If so, it should check to see if the trip purpose from the preceding row was "11" or not. If it is, it should add 1 to the preceding identifier to signify a new loop. If it's not preceded by a trip purpose of "11", it should assign the exact same loop identifier as the preceding row and move onto the next row. When it runs, first, it seems to take an enormous amount of time, and second, it fills everything with a 1, rather than incrementing and restarting as I expected.
I expected a dataframe like this. X is the properly calculated value from Excel. newcol is the column where I was trying to calculate the value in r in the newcol. The newcol values should be the same as X, but they're not. (I've updated the table below to reflect in newcol what I had hoped to see in the output).
h_id p_ID t_pur t_distance X Conc t_mode1 has_work newcol
1000002 10000022 6 753.2 1 10000022-1 1 0 1
1000002 10000022 11 753.2 1 10000022-1 1 0 1
1000002 10000022 7 4681.6 2 10000022-2 1 0 2
1000002 10000022 11 4681.6 2 10000022-2 1 0 2
1000013 10000131 5 616.1 1 10000131-1 1 0 1
1000013 10000131 11 616.1 1 10000131-1 1 0 1
1000013 10000132 1 9626.8 1 10000132-1 1 1 1
1000013 10000132 2 641.4 1 10000132-1 1 0 1
1000013 10000132 2 15076.6 1 10000132-1 1 0 1
1000013 10000132 11 21407.6 1 10000132-1 1 0 1
1000013 10000132 6 24273.3 2 10000132-2 1 0 2
1000013 10000132 11 24273.3 2 10000132-2 1 0 2
UPDATE:
I went back and gave some thought to the assignment inside the ifelse construct based on the comment below and realized that didn't make much sense. So I tried rewriting the code, as follows:
for (i in 1:nrow(loops4)) {
loops4$newcol[i] <- ifelse(i == 1, 1, ifelse (loops4$p_ID[i-1]==loops4$p_ID[i], ifelse(loops4$t_pur[i-1]==11, loops4$newcol[i-1]+1, loops$newcol[i-1], 1)))
}
But I got the same unexpected results.
UPDATED UPDATE:
There is maybe an error in my dput data from before. I had manually added a few values. I've pasted the new dput data below.
structure(list(h_id = c(1000002L, 1000002L, 1000002L, 1000002L,
1000013L, 1000013L, 1000013L, 1000013L, 1000013L, 1000013L, 1000013L,
1000013L), p_ID = c(10000022L, 10000022L, 10000022L, 10000022L,
10000131L, 10000131L, 10000132L, 10000132L, 10000132L, 10000132L,
10000132L, 10000132L), t_pur = c(6L, 11L, 7L, 11L, 5L, 11L, 1L,
2L, 2L, 11L, 6L, 11L), t_distance = c(753.154936, 753.154936,
4681.630497, 4681.630497, 616.0517311, 616.0517311, 9626.798385,
641.3674532, 15076.61817, 21407.55851, 24273.31161, 24273.31161
), X = c(1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L), Conc = c("10000022-1",
"10000022-1", "10000022-2", "10000022-2", "10000131-1", "10000131-1",
"10000132-1", "10000132-1", "10000132-1", "10000132-1", "10000132-2",
"10000132-2"), t_mode1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), has_work = c(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0),
newcol = c(1L, 1L, 2L, 2L, 1L, 1L, 0L, 0L, 1L, 1L, 2L, 2L
)), .Names = c("h_id", "p_ID", "t_pur", "t_distance", "X",
"Conc", "t_mode1", "has_work", "newcol"), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -12L))
And I've tried the dplyr approach helpfully suggested below.
loops_good <- loops5 %>%
group_by(h_id) %>% group_by (p_ID) %>%
mutate(newcol = cumsum(lead(t_pur, default = 0) == 11)) %>%
ungroup()
The results I get are almost what I'd been hoping for. But not quite. Rows 7 through 10 should all be grouped together with the same identifier in newcol because there is no intervening "11". The t_pur sequence is 1, 2, 2, 11. But in the output below, the newcol values come in as 0, 0, 1, 1.
Basically, I'm trying to link together individual segments into overall trips, starting over each time there's a return home signified by a t_pur == "11". Sometimes, it's just an out-and-back (two segments). But sometimes, as in rows 7-10, it's 4 segments.
Actual output:
h_id p_ID t_pur t_distance X Conc t_mode1 has_work newcol
<int> <int> <int> <dbl> <int> <chr> <int> <dbl> <int>
1 1000002 10000022 6 753. 1 10000022-1 1 0 1
2 1000002 10000022 11 753. 1 10000022-1 1 0 1
3 1000002 10000022 7 4682. 2 10000022-2 1 0 2
4 1000002 10000022 11 4682. 2 10000022-2 1 0 2
5 1000013 10000131 5 616. 1 10000131-1 1 0 1
6 1000013 10000131 11 616. 1 10000131-1 1 0 1
7 1000013 10000132 1 9627. 1 10000132-1 1 1 0
8 1000013 10000132 2 641. 1 10000132-1 1 0 0
9 1000013 10000132 2 15077. 1 10000132-1 1 0 1
10 1000013 10000132 11 21408. 1 10000132-1 1 0 1
11 1000013 10000132 6 24273. 2 10000132-2 1 0 2
12 1000013 10000132 11 24273. 2 10000132-2 1 0 2
Hoped for output:
h_id p_ID t_pur t_distance X Conc t_mode1 has_work newcol
<int> <int> <int> <dbl> <int> <chr> <int> <dbl> <int>
1 1000002 10000022 6 753. 1 10000022-1 1 0 1
2 1000002 10000022 11 753. 1 10000022-1 1 0 1
3 1000002 10000022 7 4682. 2 10000022-2 1 0 2
4 1000002 10000022 11 4682. 2 10000022-2 1 0 2
5 1000013 10000131 5 616. 1 10000131-1 1 0 1
6 1000013 10000131 11 616. 1 10000131-1 1 0 1
7 1000013 10000132 1 9627. 1 10000132-1 1 1 1
8 1000013 10000132 2 641. 1 10000132-1 1 0 1
9 1000013 10000132 2 15077. 1 10000132-1 1 0 1
10 1000013 10000132 11 21408. 1 10000132-1 1 0 1
11 1000013 10000132 6 24273. 2 10000132-2 1 0 2
12 1000013 10000132 11 24273. 2 10000132-2 1 0 2