Removing duplicates based on two columns in R

Question

Suppose my data is as follows,

I want to remove the rows of duplicates. I am able to remove the duplicate rows based on one column by this command,

dat[c(T, diff(dat$X) != 0), ] or dat[c(T, diff(dat$Y) != 0), ]

But I want to remove the duplicates only when both the columns have the same previous value. I can't use unique here because the same data would occur later on. I want to check the previous value and calculate it

My sample output is,

How can we do this in R?

Thanks

Ijaz

Perhaps `dat[Reduce("|",lapply(dat, function(x) c(T, diff(x)!=0))),]` — akrun, Jul 29 '15 at 22:04
There already is a `duplicated` method for dataframes. `dat[!duplicated(dat),]` — IRTFM, Jul 29 '15 at 22:05
can you try the `Reduce` option. You mentioned about the other methods are not working. It may be better to produce an example that does not work with the methods posted. — akrun, Jul 29 '15 at 22:09
@BondedDust This works for all the columns right? I have 7 columns. But I want to remove duplicates using only two columns. — Observer, Jul 29 '15 at 22:10
@Observer You can subset the columns to check the duplicates `!duplicated(dat[1:2])` — akrun, Jul 29 '15 at 22:11

Arun · Accepted Answer · 2015-07-29T22:21:49.463

Using data.table v1.9.5 - installation instructions here:

require(data.table) # v1.9.5+
df[!duplicated(rleidv(df, cols = c("X", "Y"))), ]

rleidv() is best understood with examples:

rleidv(c(1,1,1,2,2,3,1,1))
# [1] 1 1 1 2 2 3 4 4

A unique index is generated for each consecutive run of values.

And the same can be accomplished on a list() or data.frame() or data.table() on a specific set of columns as well. For example:

df = data.frame(a = c(1,1,2,2,1), b = c(2,3,4,4,2))
rleidv(df) # computes on both columns 'a,b'
# [1] 1 2 3 3 4
rleidv(df, cols = "a") # only looks at 'a'
# [1] 1 1 2 2 3

The rest should be fairly obvious. We just check for duplicated() values, and return the non-duplicated ones.

score 0 · Answer 2 · answered Jul 29 '15 at 22:05

0

using dplyr:

library(dplyr)
z %>% filter(X != lag(X) | Y != lag(Y) | row_number() == 1)

We need to include the row_number()==1 or we lose the first row

answered Jul 29 '15 at 22:05

jeremycg

24,657
5
63
74

Removing duplicates based on two columns in R

2 Answers2

Linked