Remove/collapse consecutive duplicate values in sequence

Question

I have the following dataframe:

a a a b c c d e a a b b b e e d d

The required result should be

a b c d e a b e d

It means no two consecutive rows should have same value. How it can be done without using loop.

As my data set is quite huge, looping is taking lot of time to execute.

The dataframe structure is like the following

a 1 
a 2
a 3
b 2
c 4
c 1
d 3
e 9
a 4
a 8
b 10
b 199
e 2
e 5
d 4
d 10

Result:

a 1 
b 2
c 4
d 3
e 9
a 4
b 10
e 2
d 4

Its should delete the entire row.

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2015-11-24T18:40:52.433

One easy way is to use rle:

Here's your sample data:

x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
# Read 17 items

rle returns a list with two values: the run length ("lengths"), and the value that is repeated for that run ("values").

rle(x)$values
# [1] "a" "b" "c" "d" "e" "a" "b" "e" "d"

Update: For a `data.frame`

If you are working with a data.frame, try something like the following:

## Sample data
mydf <- data.frame(
  V1 = c("a", "a", "a", "b", "c", "c", "d", "e", 
         "a", "a", "b", "b", "e", "e", "d", "d"),
  V2 = c(1, 2, 3, 2, 4, 1, 3, 9, 
         4, 8, 10, 199, 2, 5, 4, 10)
)

## Use rle, as before
X <- rle(mydf$V1)
## Identify the rows you want to keep
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
Y
# [1]  1  4  5  7  8  9 11 13 15
mydf[Y, ]
#    V1 V2
# 1   a  1
# 4   b  2
# 5   c  4
# 7   d  3
# 8   e  9
# 9   a  4
# 11  b 10
# 13  e  2
# 15  d  4

Update 2

The "data.table" package has a function rleid that lets you do this quite easily. Using mydf from above, try:

library(data.table)
as.data.table(mydf)[, .SD[1], by = rleid(V1)]
#    rleid V2
# 1:     1  1
# 2:     2  2
# 3:     3  4
# 4:     4  3
# 5:     5  9
# 6:     6  4
# 7:     7 10
# 8:     8  2
# 9:     9  4

How can i use this in the data.frame? If i would use this in list, then again i have to map this unique values with the old data.frame but then its not possible to map as the length is less. What my purpose is to remove the entire row of the dataframe whenever i would get the same value in consecutive rows for a particular column. — Amarjeet, Dec 15 '14 at 12:07
Nice answer! For you first ``data.frame`` solution, I found I needed ``X <- rle(as.numeric(mydf$V1))``, as ``V1`` is a factor. Other remark: I found that in some cases ``cumsum(X$lengths)`` would do the job, depending on which duplicate rows you wish to keep (top to bottom v. bottom to top), do you confirm? — PatrickT, Jun 21 '16 at 08:20

Khashaa · Answer 2 · 2014-12-15T14:47:02.993

library(dplyr)
x <- c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "b", "e", "e", "d", "d")
x[x!=lag(x, default=1)]
#[1] "a" "b" "c" "d" "e" "a" "b" "e" "d"

EDIT: For data.frame

  mydf <- data.frame(
    V1 = c("a", "a", "a", "b", "c", "c", "d", "e", 
         "a", "a", "b", "b", "e", "e", "d", "d"),
    V2 = c(1, 2, 3, 2, 4, 1, 3, 9, 
         4, 8, 10, 199, 2, 5, 4, 10),
   stringsAsFactors=FALSE)

dplyr solution is one liner:

mydf %>% filter(V1!= lag(V1, default="1"))
#  V1 V2
#1  a  1
#2  b  2
#3  c  4
#4  d  3
#5  e  9
#6  a  4
#7  b 10
#8  e  2
#9  d  4

post scriptum

lead(x,1) suggested by @Carl Witthoft iterates in reverse order.

leadit<-function(x) x!=lead(x, default="what")
rows <- leadit(mydf[ ,1])
mydf[rows, ]

#   V1  V2
#3   a   3
#4   b   2
#6   c   1
#7   d   3
#8   e   9
#10  a   8
#12  b 199
#14  e   5
#16  d  10

Could it be that you're using `dplyr::lag` here? I tried your code with a clean session and it doenst work with `stats::lag` — talat, Dec 15 '14 at 11:31

Colonel Beauvel · Answer 3 · 2014-12-15T11:42:44.817

6

With base R, I like funny algorithmics:

x <- c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "b", "e", "e", "d", "d")

x[x!=c(x[-1], FALSE)]
#[1] "a" "b" "c" "d" "e" "a" "b" "e" "d"

edited Dec 15 '14 at 11:42

answered Dec 15 '14 at 11:26

Colonel Beauvel

30,423
11
47
87

2

Similarly could use indexing instead of `tail`, something like `x[x != c(x[-1], FALSE)]` – David Arenburg Dec 15 '14 at 11:39
I struggled to adapt this to factors inside a ``data.frame``. Within a dataframe, the rhs, ``FALSE`` constrains the vector to be of type ``integer``, so on the lhs ``x`` must be so constrained for the comparison to be possible, with ``as.integer()``. Correct me if I'm wrong! – PatrickT Jun 21 '16 at 08:32
sorry but it's very unclear ... you can ask a question on SO if you are meeting an issue on such a topic, with your input, output and what you have done so far. – Colonel Beauvel Jun 21 '16 at 08:35

Carl Witthoft · Answer 4 · 2014-12-15T11:44:28.490

3

Much as I like,... errr, love rle , here's a shootoff:

EDIT: Can't figure out exactly what's up with dplyr so I used dplyr::lead . I'm on OSX, R3.1.2, and latest dplyr from CRAN.

xlet<-sample(letters,1e5,rep=T)
rleit<-function(x) rle(x)$values
lagit<-function(x) x[x!=lead(x, default=1)]
tailit<-function(x) x[x!=c(tail(x,-1), tail(x,1))]



  microbenchmark(rleit(xlet),lagit(xlet),tailit(xlet),times=20)
Unit: milliseconds
         expr      min       lq   median       uq      max neval
  rleit(xlet) 27.43996 30.02569 30.20385 30.92817 37.10657    20
  lagit(xlet) 12.44794 15.00687 15.14051 15.80254 46.66940    20
 tailit(xlet) 12.48968 14.66588 14.78383 15.32276 55.59840    20

edited Dec 15 '14 at 11:44

answered Dec 15 '14 at 11:27

Carl Witthoft

20,573
9
43
73

Did `lagit` work for you? It returns nothing to me. Maybe this is why its the fastest.... – David Arenburg Dec 15 '14 at 11:32
I think you need to re run this after loading `dplyr`, as I suspect `lagit` isnt doing anything currently and thus very fast – David Arenburg Dec 15 '14 at 11:40
Oops, yep-- I didn't check the outputs. But what's up? I see "lag" listed under "leadlag" but there's no actual function `dplyr::lag` – Carl Witthoft Dec 15 '14 at 11:42
1

The `tailit` function in your benchmark is missing the last "d" in the vector.. perhaps update to `x[x!=c(x[-1], FALSE)]` as in the answer – talat Dec 15 '14 at 11:54
Seems like `lead` is little slower than `lag`. Why is that so? https://github.com/hadley/dplyr/blob/master/R/lead-lag.R – Khashaa Dec 15 '14 at 13:43
@DavidArenburg Upon reflection, `lagit` returned zero because I had a wide range of values so there weren't any repeats. If you rerun my case with `xlet<-sample(letters[1:7],1e5,rep=T)` you'll get some hits. – Carl Witthoft Dec 15 '14 at 14:41
Is there any **LAG** function equivalent in R ?? (http://stackoverflow.com/questions/20636929/retain-and-lag-function-in-r-as-sas) I have tried this but did n't work. – Amarjeet Dec 15 '14 at 20:38

score 0 · Answer 5 · answered Jan 22 '23 at 22:04

Tidyverse solution:

x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
x <- tibble(x)
x |> 
 mutate(id = consecutive_id(x)) |> 
 distinct(x, id)

In addition, if there is another column y associated with the consecutive values column, this solution allows some flexibility:

x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
x <- tibble(x, y = runif(length(x)))
x |> 
    group_by(id = consecutive_id(x)) |> 
    slice_min(y)

We can choose between the different slice functions, like slice_max, slice_min, slice_head, and slice_tail.

This Stack Overflow thread appeared in the second edition of R4DS, in the Numbers chapter of the book.

Remove/collapse consecutive duplicate values in sequence

5 Answers5

Update: For a `data.frame`

Update 2

Linked

Related

Remove/collapse consecutive duplicate values in sequence

5 Answers5

Update: For a data.frame

Update 2

Linked

Related

Update: For a `data.frame`