Exclude subsequent duplicated rows

Question

I would like to exclude all duplicated rows. However, it has to be true just when they are subsequent rows. Follows a representative example:

My input df:

    df <- "NAME   VALUE 
    Prb1  0.05
    Prb2  0.05
    Prb3  0.05
    Prb4  0.06
    Prb5  0.06
    Prb6  0.01
    Prb7  0.10
    Prb8  0.05"

df <- read.table(text=df, header=T)

My expected outdf:

outdf <- "NAME   VALUE 
Prb1  0.05
Prb4  0.06
Prb6  0.01
Prb7  0.10
Prb8  0.05"

outdf <- read.table(text=df, header=T)

Josh O'Brien · Accepted Answer · 2015-05-15T13:39:36.353

14

rle() is a nice function that identifies runs of identical values, but it can be kind of a pain to wrestle it's output into a usable form. Here's a relatively painless incantation that works in your case.

df[sequence(rle(df$VALUE)$lengths) == 1, ]
#   NAME VALUE
# 1 Prb1  0.05
# 4 Prb4  0.06
# 6 Prb6  0.01
# 7 Prb7  0.10
# 8 Prb8  0.05

edited May 15 '15 at 13:39

answered May 15 '15 at 13:33

Josh O'Brien

159,210
26
366
455

David Arenburg · Answer 2 · 2015-05-15T14:26:44.227

There are probably many ways of solving this, I would try rleid/unique combination from the data.table devel version

library(data.table) ## v >= 1.9.5
unique(setDT(df)[, indx := rleid(VALUE)], by = "indx")
#    NAME VALUE indx
# 1: Prb1  0.05    1
# 2: Prb4  0.06    2
# 3: Prb6  0.01    3
# 4: Prb7  0.10    4
# 5: Prb8  0.05    5

Or from some great suggestions from comments:

Using just the new shift function

setDT(df)[VALUE != shift(VALUE, fill = TRUE)]

Or using duplicated combined with rleid

setDT(df)[!duplicated(rleid(VALUE)), ]

NPE · Answer 3 · 2015-05-15T13:39:28.830

8

How about this:

> df[c(T, df[-nrow(df),-1] != df[-1,-1]), ]
  NAME VALUE
1 Prb1  0.05
4 Prb4  0.06
6 Prb6  0.01
7 Prb7  0.10
8 Prb8  0.05

Here, df[-nrow(df),-1] != df[-1,-1] finds pairs of consecutive rows that contain different values, and the rest of the code extracts them from the dataframe.

edited May 15 '15 at 13:39

answered May 15 '15 at 13:26

NPE

486,780
108
951
1,012

score 4 · Answer 4 · answered May 15 '15 at 20:46

4

I would use a solution similar to @NPE 's

df[c(TRUE,abs(diff(df$VALUE))>1e-6),]

Of course you can use any other tolerance level (other than 1e-6).

answered May 15 '15 at 20:46

cryo111

4,444
1
15
37

score 2 · Answer 5 · answered May 15 '15 at 15:13

I came across this nice function a while ago which flags rows as being first based upon a specified variable:

  isFirst <- function(x,...) {
      lengthX <- length(x)
      if (lengthX == 0) return(logical(0))
      retVal <- c(TRUE, x[-1]!=x[-lengthX])
      for(arg in list(...)) {
          stopifnot(lengthX == length(arg))
          retVal <- retVal | c(TRUE, arg[-1]!=arg[-lengthX])
      }
      if (any(missing<-is.na(retVal))) # match rle: NA!=NA
          retVal[missing] <- TRUE
      retVal
  }

Applying it to your data gives:

> df$first <- isFirst(df$VALUE)
> df
  NAME VALUE first
1 Prb1  0.05  TRUE
2 Prb2  0.05 FALSE
3 Prb3  0.05 FALSE
4 Prb4  0.06  TRUE
5 Prb5  0.06 FALSE
6 Prb6  0.01  TRUE
7 Prb7  0.10  TRUE
8 Prb8  0.05  TRUE

You can then dedup on the first column to get your expected output.

I've found this very useful in the past, especially coming from a SAS background where this was very easy to do.

zx8754 · Answer 6 · 2015-05-15T21:29:35.493

2

Many good answers already, here is dplyr version:

filter(df,VALUE!=lag(VALUE,default=df$VALUE[1]+1))

edited May 15 '15 at 21:29

answered May 15 '15 at 21:20

zx8754

52,746
12
114
209

Exclude subsequent duplicated rows

6 Answers6