1

Suppose my data is as follows,

X    Y
26  14
26  14
26  15
26  15
27  15
27  15
28  16
28  16

I want to remove the rows of duplicates. I am able to remove the duplicate rows based on one column by this command,

dat[c(T, diff(dat$X) != 0), ] or dat[c(T, diff(dat$Y) != 0), ]

But I want to remove the duplicates only when both the columns have the same previous value. I can't use unique here because the same data would occur later on. I want to check the previous value and calculate it

My sample output is,

x   y
26  14
26  15
27  15
28  16

How can we do this in R?

Thanks

Ijaz

Observer
  • 641
  • 2
  • 14
  • 31
  • Perhaps `dat[Reduce("|",lapply(dat, function(x) c(T, diff(x)!=0))),]` – akrun Jul 29 '15 at 22:04
  • There already is a `duplicated` method for dataframes. `dat[!duplicated(dat),]` – IRTFM Jul 29 '15 at 22:05
  • can you try the `Reduce` option. You mentioned about the other methods are not working. It may be better to produce an example that does not work with the methods posted. – akrun Jul 29 '15 at 22:09
  • @BondedDust This works for all the columns right? I have 7 columns. But I want to remove duplicates using only two columns. – Observer Jul 29 '15 at 22:10
  • 3
    You've chosen a bad example, IIUC. – Arun Jul 29 '15 at 22:10
  • 1
    @Observer You can subset the columns to check the duplicates `!duplicated(dat[1:2])` – akrun Jul 29 '15 at 22:11

2 Answers2

4

Using data.table v1.9.5 - installation instructions here:

require(data.table) # v1.9.5+
df[!duplicated(rleidv(df, cols = c("X", "Y"))), ]

rleidv() is best understood with examples:

rleidv(c(1,1,1,2,2,3,1,1))
# [1] 1 1 1 2 2 3 4 4

A unique index is generated for each consecutive run of values.

And the same can be accomplished on a list() or data.frame() or data.table() on a specific set of columns as well. For example:

df = data.frame(a = c(1,1,2,2,1), b = c(2,3,4,4,2))
rleidv(df) # computes on both columns 'a,b'
# [1] 1 2 3 3 4
rleidv(df, cols = "a") # only looks at 'a'
# [1] 1 1 2 2 3

The rest should be fairly obvious. We just check for duplicated() values, and return the non-duplicated ones.

Arun
  • 116,683
  • 26
  • 284
  • 387
0

using dplyr:

library(dplyr)
z %>% filter(X != lag(X) | Y != lag(Y) | row_number() == 1)

We need to include the row_number()==1 or we lose the first row

jeremycg
  • 24,657
  • 5
  • 63
  • 74