Edit / merge rows if consecutive factor with R

Question

I have this sample

    data <- structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L), .Label = c("a", "b"), class = "factor"), minTime = structure(c(1L, 
2L, 3L, 4L, 5L, 7L, 6L, 8L, 9L, 10L), .Label = c("2014-06-06 07:39:50", 
"2014-06-07 02:24:32", "2014-06-07 15:14:29", "2014-06-07 2:29", 
"2014-06-08 5:40", "2014-06-18 17:54:42", "2014-06-18 2:45", 
"2014-06-19 02:37:53", "2014-06-19 19:15", "2014-06-19 22:15"
), class = "factor"), maxTime = structure(c(1L, 3L, 4L, 2L, 5L, 
6L, 7L, 8L, 9L, 10L), .Label = c("2014-06-07 01:41:31", "2014-06-07 10:01", 
"2014-06-07 14:44:08", "2014-06-07 22:31:02", "2014-06-08 5:50", 
"2014-06-18 2:50", "2014-06-19 01:49:05", "2014-06-19 18:51:36", 
"2014-06-19 20:15", "2014-06-19 23:15"), class = "factor"), duration.minutes = c(NA, 
740L, 437L, 452L, NA, NA, 474L, 974L, 4062L, 353L), event = structure(c(1L, 
4L, 4L, 2L, 2L, 1L, 4L, 4L, 3L, 4L), .Label = c("enter", "exit", 
"stop", "trip"), class = "factor")), .Names = c("id", "minTime", 
"maxTime", "duration.minutes", "event"), class = "data.frame", row.names = c(NA, 
-10L))

I would like to merge the events "trips" if they are consecutive for each id.

Where the consecutives trips for the id a and b have been merged:

maxTime was edited to have the full lenght of the trip
duration.minutes was also edited.

For example, this output:

dput(output.wanted)
structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 1L, 2L), .Label = c("a", 
"b"), class = "factor"), minTime = structure(c(5L, 6L, 7L, 2L, 
1L, 3L, 4L), .Label = c("18.6.2014 17:54", "18.6.2014 2:45", 
"19.6.2014 19:15", "19.6.2014 22:15", "6.6.2014 7:39", "7.6.2014 2:24", 
"7.6.2014 2:29"), class = "factor"), maxTime = structure(c(5L, 
7L, 6L, 1L, 2L, 3L, 4L), .Label = c("18.6.2014 2:50", "19.6.2014 18:51", 
"19.6.2014 20:15", "19.6.2014 23:15", "7.6.2014 1:41", "7.6.2014 10:01", 
"7.6.2014 22:31"), class = "factor"), duration.minutes = c(NA, 
1177L, 452L, NA, 1448L, 4062L, 353L), event = structure(c(1L, 
4L, 2L, 1L, 4L, 3L, 4L), .Label = c("enter", "exit", "stop", 
"trip"), class = "factor")), .Names = c("id", "minTime", "maxTime", 
"duration.minutes", "event"), class = "data.frame", row.names = c(NA, 
-7L))

The point is that I am looking to merge all event (trip or stop) if they are consecutive for the same id. but if the stop or the trip is unique, it stays the same

I have been trying to do it with group_by and mutate but I am a bit lost...

-- I found the solution but there is still a minor issue with the events stops

Here is a sample:

> dput(total)
structure(list(Ship = c(205482000, 205482000, 205482000, 205482000, 
205482000, 205482000, 205482000, 205482000, 205482000, 205482000, 
205482000, 205482000), minTime = structure(c(1401570241, 1401969219, 
1401981860, 1402052108, 1402768362, 1402772602, 1402841443, 1402855773, 
1403056361, 1403278916, 1403290856, 1403367735), class = c("POSIXct", 
"POSIXt")), maxTime = structure(c(1401966639, 1401980399, 1402051849, 
1402056430, 1402772313, 1402839873, 1402852433, 1403052550, 1403276355, 
1403289496, 1403367596, 1403371285), class = c("POSIXct", "POSIXt"
)), duration.minutes = structure(c(6607, 186, 1166, NA, NA, 1121, 
183, 3280, 3667, 176, 1279, NA), class = "difftime", units = "mins"), 
    event = c("stop", "trip", "trip", "exit", "enter", "trip", 
    "trip", "stop", "stop", "trip", "trip", "exit"), dist.sailed = c(NA, 
    50254.2034817555, 349194.108518887, NA, NA, 347816.081064252, 
    50035.8859874946, NA, NA, 49982.687612038, 351978.737678528, 
    NA)), .Names = c("Ship", "minTime", "maxTime", "duration.minutes", 
"event", "dist.sailed"), class = "data.frame", row.names = c(4L, 
5L, 6L, 2L, 1L, 7L, 8L, 9L, 10L, 11L, 12L, 3L))

the following code is not producing the complete duration of the stop, but only NA:

 total <- total %>% 
    group_by(Ship) %>% 
    mutate(new_id = data.table::rleid(event)) %>% 
    group_by(event, new_id, Ship) %>% 
    mutate(duration.minutes = ifelse(event == 'trip', sum(duration.minutes), duration.minutes), maxTime = tail(maxTime, 1))%>% 
    mutate(duration.minutes = ifelse(event == 'stop', sum(duration.minutes), duration.minutes), maxTime = tail(maxTime, 1))%>% 
    mutate(dist.sailed = ifelse(event == 'trip', sum(dist.sailed), dist.sailed), dist.sailed = tail(dist.sailed, 1)) %>% 
    filter(!duplicated(duration.minutes)) %>% 
    select(-new_id)

I added mutate(duration.minutes = ifelse(event == 'stop', sum(duration.minutes), duration.minutes), maxTime = tail(maxTime, 1))%>% to the code from @Sotos)

Please supply a minimal example or do we need to understand all col/rows? — Christoph, Jul 15 '16 at 08:14
It is easier to understand all col/rows: there are events (trip, stop, enter or exit). minTime is the beginning of the vent and maxTime the end. — Floni, Jul 15 '16 at 08:17
isn't this question similar to this one? http://stackoverflow.com/questions/38054518/how-to-merge-rows-with-consecutive-datetime/38055667#38055667 except you'd need to group by id and event? — ArunK, Jul 15 '16 at 08:32
@theArun, yes it is and I try to work with this one and I added a comment there. The solution they found does not work... — Floni, Jul 15 '16 at 08:33
@ Floni. It does, I wrote the solution. I've just tested it and it still works :) — ArunK, Jul 15 '16 at 08:35
@theArun, I don´t have anything in subdf1... I used `subdf1 <- data%>% group_by(id,event) %>% mutate(minTime_lead = lead(minTime),maxTime_lead = lead(maxTime)) %>% filter(maxTime==minTime_lead) %>% mutate(maxTime = maxTime_lead) %>% select(-minTime_lead,-maxTime_lead) %>% ungroup()` — Floni, Jul 15 '16 at 09:23
That's cause in your particular case you aren't looking for a situation where in `maxTime == minTime_lead`. @Sotos has got a good answer. — ArunK, Jul 15 '16 at 10:11

Sotos · Accepted Answer · 2016-07-20T06:18:59.603

2

This gets a bit messy,

library(dplyr)
data %>% 
   group_by(id) %>% 
   mutate(new_id = data.table::rleid(event)) %>% 
   group_by(event, new_id, id) %>% 
   mutate(duration.minutes = ifelse(event == 'trip', sum(duration.minutes), duration.minutes), maxTime = tail(maxTime, 1)) %>% 
   filter(!duplicated(duration.minutes)) %>% 
   select(-new_id) 

#new_id     id             minTime             maxTime duration.minutes  event
#   <int> <fctr>              <fctr>              <fctr>            <int> <fctr>
#1      1      a 2014-06-06 07:39:50 2014-06-07 01:41:31               NA  enter
#2      2      a 2014-06-07 02:24:32 2014-06-07 22:31:02             1177   trip
#3      3      a     2014-06-07 2:29     2014-06-08 5:50              452   exit
#4      3      a     2014-06-08 5:40     2014-06-08 5:50               NA   exit
#5      1      b     2014-06-18 2:45     2014-06-18 2:50               NA  enter
#6      2      b 2014-06-18 17:54:42 2014-06-19 18:51:36             1448   trip
#7      3      b    2014-06-19 19:15    2014-06-19 20:15             4062   stop
#8      4      b    2014-06-19 22:15    2014-06-19 23:15              353   trip

If we run the same code for stop instead of trip we get the following results

#new_id      Ship             minTime             maxTime duration.minutes event dist.sailed
#    <int>     <dbl>              <time>              <time>            <dbl> <chr>       <dbl>
#1       1 205482000 2014-06-01 00:04:01 2014-06-05 14:10:39             6607  stop          NA
#2       2 205482000 2014-06-05 14:53:39 2014-06-06 13:50:49              186  trip    50254.20
#3       2 205482000 2014-06-05 18:24:20 2014-06-06 13:50:49             1166  trip   349194.11
#4       3 205482000 2014-06-06 13:55:08 2014-06-06 15:07:10               NA  exit          NA
#5       4 205482000 2014-06-14 20:52:42 2014-06-14 21:58:33               NA enter          NA
#6       5 205482000 2014-06-14 22:03:22 2014-06-15 20:13:53             1121  trip   347816.08
#7       5 205482000 2014-06-15 17:10:43 2014-06-15 20:13:53              183  trip    50035.89
#8       6 205482000 2014-06-15 21:09:33 2014-06-20 17:59:15             6947  stop          NA
#9       7 205482000 2014-06-20 18:41:56 2014-06-21 19:19:56              176  trip    49982.69
#10      7 205482000 2014-06-20 22:00:56 2014-06-21 19:19:56             1279  trip   351978.74
#11      8 205482000 2014-06-21 19:22:15 2014-06-21 20:21:25               NA  exit          NA

edited Jul 20 '16 at 06:18

answered Jul 15 '16 at 09:12

Sotos

51,121
6
32
66

thanks a lot. I tried on a bigger sample (here: https://www.dropbox.com/sh/bcl5k8t6x5n1rgg/AAB80VppmvV0EttmTRX9-tbya?dl=0) and it does not work that much... it is changing lot of the data! – Floni Jul 15 '16 at 10:54
@Floni what do you mean changing lot of the data? (Cant access your full dataset - work restrictions) – Sotos Jul 15 '16 at 11:02
It is doing a lot of things (group by, mutate) but I can not understand what it is doing! It looks very odd. – Floni Jul 15 '16 at 11:31
@Floni I added the output from your example data. If this is what it should be produced from your sample data, then you should update and include any cases that don't work... – Sotos Jul 15 '16 at 11:45
I edite the example data 2 hours ago because of an issue with the dates format. Now the sample is good, and the solution you gave is not worrking completely. It is mixing the id a and b – Floni Jul 15 '16 at 11:53
So what you see as output at my answer is not correct? Can you please add the expected output then? - Note that I also edited my answer 15 mins ago – Sotos Jul 15 '16 at 11:56
thank you. I now got the error `Error: unexpected ',' in: " group_by(event, new_id, id) %>% mutate(duration.minutes = ifelse(event == 'trip', sum(duration.minutes), duration.minutes)),"` and `Error in duplicated(duration.minutes) : object 'duration.minutes' not found` – Floni Jul 18 '16 at 06:46
1

@Floni sorry...typo...fixed. – Sotos Jul 18 '16 at 06:56
1

Thanks, it works perfectly! I will try the same for the events `stops`! – Floni Jul 18 '16 at 07:46
the command `mutate(duration.minutes = ifelse(event == 'trip', sum(duration.minutes), duration.minutes), maxTime = tail(maxTime, 1))` is producing the duration.minutes for the event trip. But if I want to do it also for the stops, it does not calculate the durations.minutes, only NA! – Floni Jul 19 '16 at 12:18
So, `mutate(duration.minutes = ifelse(event == 'stop', sum(duration.minutes), duration.minutes), maxTime = tail(maxTime, 1))` produces only NA? – Sotos Jul 19 '16 at 12:24
yes, it does not calculate the duration of the event, it only generates NA. but with something like `total$duration.minutes[total$event== "stop"] <- total$maxTime - total$minTime` after your command it could be fine! There will be warning messages but it could work – Floni Jul 19 '16 at 12:26
hmm... I Just run it with `stop` and it works fine for me – Sotos Jul 19 '16 at 12:33
I edited my first post - you can at the end of it the new issue with the stops! – Floni Jul 19 '16 at 12:40
Actually `total$duration.minutes[total$event== "stop"] <- total$maxTime - total$minTime` gives a non-sense result... – Floni Jul 19 '16 at 13:07
@Floni your final sample has all `NA` for `stop` – Sotos Jul 19 '16 at 14:05
for the column `duration.minutes`? It does not look like this here. And for example if I apply the code I posted for the `stops` then the merging of the rows 9 and 10 are producing NA for `duration.minutes` but it should be the sum of both these stops: 3280+3667. – Floni Jul 20 '16 at 06:09
@Floni it works for me. (Had to change the `group_by(id)` to `group_by(Ship)` since you changed the column but it works fine...I updated the results... – Sotos Jul 20 '16 at 06:17
Yes, but then it is not doing it anymore for the trips but only for the stops. I am looking for something to merge both stops and trips. I posted a new code yesterday evening on my very first port where there is also a new line to sum up the dist.sailed column. – Floni Jul 20 '16 at 06:48
ahhh...ok...you can use in one line `mutate(duration.minutes = ifelse(event %in% c('trip', 'stop'), sum(duration.minutes), duration.minutes), maxTime = tail(maxTime, 1))` – Sotos Jul 20 '16 at 06:53
Still not:(. I still have a NA in duration.minutes for the 2 stops merged between the 15th and 20th of June. – Floni Jul 20 '16 at 07:07
1

@Floni, I dunno... It works on my machine... I can't really help you any further since I can't reproduce your error... :/ – Sotos Jul 20 '16 at 07:12
ok... could it be because I have some other packages opened and they are in conflicts? Really odd! – Floni Jul 20 '16 at 07:18
1

I don't think so. I also have a lot of packages loaded but still works. – Sotos Jul 20 '16 at 07:22
1

ok. Thank you anyway! I think I found the solution by running twice the script: one for the trips (producing the duration.minutes and the dist.sailed columns) and a second one only for the stop (duration.miutes). I will test on the final dataset. – Floni Jul 20 '16 at 07:45
there is a problem with my computer... when I use the script with another computer, it works well... when I do `libaray(dplyr)`, I got: `Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union` . Filter is masked, maybe the reason? – Floni Jul 20 '16 at 08:41
1

If it is I never heard it before. Maybe you ought to ask a new question about this...? – Sotos Jul 20 '16 at 08:52
1

RStudio and the packages needed a major update. Now it works perfectly. Thanks again! – Floni Jul 20 '16 at 12:19

score 0 · Answer 2 · answered Jul 15 '16 at 08:30

0

if you use the dplyr package, you have to use summarise, not mutate to combine data in some way. Like

some.df %>% 
group_by(event) %>% 
summarise(sum.maxTime = sum(maxTime), sum.duration = sum(duration.minutes))

Mutate adds or change a column in your data.frame, but the result will always have the same number of rows as the original. The row number after summarise will be that of the number of groups in your dataset.

Hope this helps

answered Jul 15 '16 at 08:30

Ulrik

1,575
2
10
10

Yes, I was almost there. But I need to edit this only if the events "trip" or "stop" are consecutive for one id, not all the time. – Floni Jul 15 '16 at 08:35

Edit / merge rows if consecutive factor with R

2 Answers2