Remove duplicates if an observation appears consecutively, order matters

Question

I have a dataframe grouped by bikeid and sorted by time. If type repeats consecutively, I want to keep the earliest time. In the case below, I want to remove line 17, 19,33,39 and 41

subtract value from previous row by group This will get what I need once I removed the duplicates.

    bikeid    type              time
1   1004    repair_time 2019-04-04 14:07:00
3   1004    red_time    2019-04-19 00:54:56
8   1004    repair_time 2019-04-19 12:47:00
10  1004    red_time    2019-04-19 16:45:18
15  1004    repair_time 2019-04-20 04:42:00
17  1004    repair_time 2019-04-20 05:29:00
19  1004    repair_time 2019-04-28 07:33:00
27  1010    repair_time 2019-04-20 10:05:00
29  1010    red_time    2019-04-22 20:51:21
33  1010    red_time    2019-04-23 11:02:34
37  1010    repair_time 2019-04-24 17:20:00
39  1010    repair_time 2019-04-24 18:30:00
41  1010    repair_time 2019-04-24 18:42:00

The final result should look this this:

   bikeid    type              time
1   1004    repair_time 2019-04-04 14:07:00
3   1004    red_time    2019-04-19 00:54:56
8   1004    repair_time 2019-04-19 12:47:00
10  1004    red_time    2019-04-19 16:45:18
15  1004    repair_time 2019-04-20 04:42:00
27  1010    repair_time 2019-04-20 10:05:00
29  1010    red_time    2019-04-22 20:51:21
37  1010    repair_time 2019-04-24 17:20:00

akrun · Accepted Answer · 2019-06-14T04:38:07.560

An option is to use rleid (from data.table) to create a grouping variable along with the second column and slice the first observation. Here, the time column is already arranged, so we don't have to do any ordering

library(dplyr)
library(data.table)
df1 %>%
    group_by(V2, grp = rleid(V3)) %>%
    slice(1) %>%
    ungroup %>%
    select(-grp)
# A tibble: 8 x 4
#     V1    V2 V3          V4                 
#  <int> <int> <chr>       <chr>              
#1     1  1004 repair_time 2019-04-04 14:07:00
#2     3  1004 red_time    2019-04-19 00:54:56
#3     8  1004 repair_time 2019-04-19 12:47:00
#4    10  1004 red_time    2019-04-19 16:45:18
#5    15  1004 repair_time 2019-04-20 04:42:00
#6    27  1010 repair_time 2019-04-20 10:05:00
#7    29  1010 red_time    2019-04-22 20:51:21
#8    37  1010 repair_time 2019-04-24 17:20:00

Or use the data.table method where we convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'V2', and rleid of 'V3', get the row index (.I) of the first observation, extract ($V1) it and subset the rows of dataset

library(data.table)
setDT(df1)[df1[, .I[1], .(V2, rleid(V3))]$V1]

data

df1 <- structure(list(V1 = c(1L, 3L, 8L, 10L, 15L, 17L, 19L, 27L, 29L, 
33L, 37L, 39L, 41L), V2 = c(1004L, 1004L, 1004L, 1004L, 1004L, 
1004L, 1004L, 1010L, 1010L, 1010L, 1010L, 1010L, 1010L), V3 = c("repair_time", 
"red_time", "repair_time", "red_time", "repair_time", "repair_time", 
"repair_time", "repair_time", "red_time", "red_time", "repair_time", 
"repair_time", "repair_time"), V4 = c("2019-04-04 14:07:00", 
"2019-04-19 00:54:56", "2019-04-19 12:47:00", "2019-04-19 16:45:18", 
"2019-04-20 04:42:00", "2019-04-20 05:29:00", "2019-04-28 07:33:00", 
"2019-04-20 10:05:00", "2019-04-22 20:51:21", "2019-04-23 11:02:34", 
"2019-04-24 17:20:00", "2019-04-24 18:30:00", "2019-04-24 18:42:00"
)), class = "data.frame", row.names = c(NA, -13L))

the first option worked perfectly. The second, $V1 used the rownames field, is that a typo? — Jacky, Jun 14 '19 at 05:03
@Jacky Here, `$V1` is the default column from `.I[1]` as we didn't name it. It is the row index of the first observation for each grouping variable. It is working fine for me — akrun, Jun 14 '19 at 14:56

Marius · Answer 2 · 2019-06-14T04:42:30.957

Another option using lag to check if the status is the same as the previous row. As akrun notes, this works because the data is already sorted by time:

library(dplyr)

df %>%
    group_by(bikeid) %>%
    mutate(repeated = status == lag(status)) %>%
    # Need the is.na() check as first element of each group is NA
    #   due to the lag
    filter(! repeated | is.na(repeated))

Data setup code:

txt = "1   1004    repair_time 2019-04-04 14:07:00
3   1004    red_time    2019-04-19 00:54:56
8   1004    repair_time 2019-04-19 12:47:00
10  1004    red_time    2019-04-19 16:45:18
15  1004    repair_time 2019-04-20 04:42:00
17  1004    repair_time 2019-04-20 05:29:00
19  1004    repair_time 2019-04-28 07:33:00
27  1010    repair_time 2019-04-20 10:05:00
29  1010    red_time    2019-04-22 20:51:21
33  1010    red_time    2019-04-23 11:02:34
37  1010    repair_time 2019-04-24 17:20:00
39  1010    repair_time 2019-04-24 18:30:00
41  1010    repair_time 2019-04-24 18:42:00"

df = read.table(text = txt, header = FALSE)
colnames(df) = c("row", "bikeid", "status", "date", "time")
df$date = as.POSIXct(paste(df$date, df$time))

Remove duplicates if an observation appears consecutively, order matters

2 Answers2

data