Removing rows in R only if they are duplicated in direct succession

Question

I have a data.table indicating the position of an animal which looks something like this:

    Date       TimeStamp             Transponder Units
 1: 2021-08-15 2021-08-15 14:11:13    DA2C614E    M2
 2: 2021-08-15 2021-08-15 14:11:14    DA2C614E    M2
 3: 2021-08-15 2021-08-15 14:11:14    DA2C614E    M2
 4: 2021-08-15 2021-08-15 14:11:15    DA2C614E    M2
 5: 2021-08-15 2021-08-15 14:11:15    DA2C614E    M2
 6: 2021-08-15 2021-08-15 14:11:16    DA2C614E    M2
 7: 2021-08-15 2021-08-15 14:12:40    DA2C614E   HM2
 8: 2021-08-15 2021-08-15 14:12:40    DA2C614E   HM2
 9: 2021-08-15 2021-08-15 14:12:49    DA2C614E    H2
10: 2021-08-15 2021-08-15 14:18:02    DA2C614E    H1
11: 2021-08-15 2021-08-15 14:18:04    DA2C614E    H1
12: 2021-08-15 2021-08-15 14:19:19    DA2C614E    H1
13: 2021-08-15 2021-08-15 14:25:29    DA2C614E   HM2

The Transponder indicates the individual and the Units the position. I would like to create a more sparse version of this data set by deleting the duplicated rows, but only those which are in succession. Essentially, because the successive duplicates just mean no movement. Meaning I would want my end results to be:

      Date       TimeStamp             Transponder Units
   1: 2021-08-15 2021-08-15 14:11:13    DA2C614E    M2
   2: 2021-08-15 2021-08-15 14:12:40    DA2C614E   HM2
   3: 2021-08-15 2021-08-15 14:12:49    DA2C614E    H2
   4: 2021-08-15 2021-08-15 14:18:02    DA2C614E    H1
   5: 2021-08-15 2021-08-15 14:25:29    DA2C614E   HM2

I have tried working with duplicated(), but the problem is that this function deletes all duplicates, like for example the later repeated HM2 entry. (I ran duplicated on a subset of Date, Transponder and Units):

> sample[!duplicated(sample[, c(1,3,4)]),]
         Date           TimeStamp Transponder Units
1: 2021-08-15 2021-08-15 14:11:13    DA2C614E    M2
2: 2021-08-15 2021-08-15 14:12:40    DA2C614E   HM2
3: 2021-08-15 2021-08-15 14:12:49    DA2C614E    H2
4: 2021-08-15 2021-08-15 14:18:02    DA2C614E    H1

Any ideas on how to solve this "elegantly", i.e. without having to loop through this?

Please provide reproducible code for your data (you can use dput()). — s_baldur, Mar 23 '22 at 09:32
Sorry, will do next time! First ever question... @sindri_baldur — Gretchen, Mar 23 '22 at 10:04

Lennyy · Accepted Answer · 2022-03-23T10:49:56.657

1

Using rleid from data.table we create a dummy-grouping variable, and with distinct from dplyr we remove the duplicates. In your data you may want to include Transponder in the rleid function, if it does vary in your real data.

library(tidyverse)
library(data.table)

df %>% 
  mutate(dummy = rleid(Units)) %>% 
  distinct(dummy, .keep_all = T) %>% 
  select(-dummy)

        Date           TimeStamp Transponder Units
1 2021-08-15 2021-08-15-14:11:13    DA2C614E    M2
2 2021-08-15 2021-08-15-14:12:40    DA2C614E   HM2
3 2021-08-15 2021-08-15-14:12:49    DA2C614E    H2
4 2021-08-15 2021-08-15-14:18:02    DA2C614E    H1
5 2021-08-15 2021-08-15-14:25:29    DA2C614E   HM2

Using just data.table and no temporary variable you could do the following: dt[!duplicated(rleid(Units)),], based on comments.

edited Mar 23 '22 at 10:49

answered Mar 23 '22 at 09:49

Lennyy

5,932
2
10
23

Thanks! rleid works perfectly! What's the advantage of using distinct over duplicated on the grouping dummy here? They both deliver the same results as far as I can see? – Gretchen Mar 23 '22 at 10:03
Great it worked. I think that is just my habit of using the tidyverse framework. `rleid` is the key function in this solution, and you could use it in here in conjuction with `duplicated` or `unique` for example as well. – Lennyy Mar 23 '22 at 10:08
2

`data.table`-variant: `df <- df[unique(rleid(Units)),]`, there is no need for temp variables or deselection (or `dplyr` even). – r2evans Mar 23 '22 at 10:15
1

unique does only give you the first 5 records. unique gives just a vector 1:5 (we have 5 groups), but then you subset the first 5 rows of your table. You need !duplicated to subset on your T/F's and keep the positions on what to keep. – Merijn van Tilborg Mar 23 '22 at 10:45
Or we just take the first of each rleid's. It messes up a bit your order and adds the rleid column, but I think on large datasets, it can be more efficient. `dt[, .SD[1L], by = .(rleid(Units), Units)]` – Merijn van Tilborg Mar 23 '22 at 11:10

VYago · Answer 2 · 2022-03-23T09:55:18.733

0

see Filtering out duplicated/non-unique rows in data.table

I think that if your data is time indexed this will work on your code:

unique(sample, by = "Date")

Example:

dt <- data.table(V1 = LETTERS[c(1,1,1,2,2,2)], V2 = c(1,1,1,2,2,2))

   V1 V2
1:  A  1
2:  A  1
3:  A  1
4:  B  2
5:  B  2
6:  B  2

unique(dt, by = "V1")

   V1 V2
1:  A  1
2:  B  2

edited Mar 23 '22 at 09:55

answered Mar 23 '22 at 09:45

VYago

325
2
9

1

Thanks, but that won't work as it will also remove valid repeated entries. – Gretchen Mar 23 '22 at 10:07
If you change your example to V2 = c(1,2,1,2,2,2) you will see how the output will be the same as for V2 = c(1,1,1,2,2,2) even though I would want the output to show the 1,2,1 sequence, as the ones are not in succession – Gretchen Mar 23 '22 at 10:18

score 0 · Answer 3 · answered Mar 23 '22 at 10:12

I wanted to have a go at this using data.table, as you stated you are using it, presumably as your data is large, so this should be faster than the dplyr approach suggested. I do not have much experience with data.table so I wanted to try and this seems to work:

# Cols to subset
cols <- c("Date", "Transponder", "Units")
lagcols <- paste0(cols, "_lag")

# Create lag
sample[, (lagcols) := shift(.SD, n = 1, fill = NA, type = "lag"), .SDcols = cols]

# Create boolean mask if row == previous row in selected columns
sample[, equals_previous := Date == Date_lag & Transponder == Transponder_lag & Units == Units_lag]

# Delete lag columns
sample[, (lagcols) := NULL]

# Subset only rows where they are not equal to the previous row
sample[(!equals_previous)]

There may be more elegant ways to do this in data.table, but if your data is of a significant size this should at least be faster than converting to a data.frame and using dplyr.

Removing rows in R only if they are duplicated in direct succession

3 Answers3