duplicates in multiple columns

Question

I have a data frame like so

> df
  a  b c    d
1 1  2 A 1001
2 2  4 B 1002
3 3  6 B 1002
4 4  8 C 1003
5 5 10 D 1004
6 6 12 D 1004
7 7 13 E 1005
8 8 14 E 1006

I want to remove the rows where there are repeated values in column c AND column d. So in this example rows 2,3,5 and 6 would removed.

I have used this, which works:

df[!(df$c %in% df$c[duplicated(df$c)] & df$d %in% df$d[duplicated(df$d)]),]
>df
  a  b c    d
1 1  2 A 1001
4 4  8 C 1003
7 7 13 E 1005
8 8 14 E 1006

but it seems clunky and I can't help but think there is a better way. Any suggestions?

In case anyone wants to re-create the data-frame here is the dput:

df <- data.frame(
  a = seq(1, 8, by = 1),
  b = c(2, 4, 6, 8, 10, 12, 13, 14),
  c = factor(c("A", "B", "B", "C", "D", "D", "E", "E")),
  d = c(1001, 1002, 1002, 1003, 1004, 1004, 1005, 1006)
)

See also http://stackoverflow.com/questions/13742446/duplicates-in-multiple-columns — Sam Firke, May 19 '16 at 17:50

score 35 · Accepted Answer · answered Dec 06 '12 at 11:26

35

It works if you use duplicated twice:

df[!(duplicated(df[c("c","d")]) | duplicated(df[c("c","d")], fromLast = TRUE)), ]

  a  b c    d
1 1  2 A 1001
4 4  8 C 1003
7 7 13 E 1005
8 8 14 E 1006

answered Dec 06 '12 at 11:26

Sven Hohenstein

80,497
17
145
168

3

Can someone explain to me why this works? I used it, and it does work, but I don't understand. Why does "NOT duplicated" OR "duplicated" de dup the data frame? – jessi Sep 18 '14 at 17:38
4

@Jessi The formula is *NOT(duplicated OR duplicated)*. The first duplicated does not identify the first occurrences of duplicated values and the second duplicated does not identify the last occurrences of duplicated values. Together, they identify *all* duplicated values. – Sven Hohenstein Sep 18 '14 at 18:08
2

How would you keep the first duplicated record only? – mob May 26 '17 at 21:32
4

@mob Do you mean `df[!duplicated(df[c("c","d")]), ]`? – Sven Hohenstein May 26 '17 at 21:55

score 28 · Answer 2 · answered Jul 14 '16 at 15:18

28

Make a new object with the 2 columns:

df_dups <- df[c("c", "d")]

Now apply it to your main df:

df[!duplicated(df_dups),]

Looks neater and easy to see/change columns that you are using.

answered Jul 14 '16 at 15:18

Tunn

1,506
16
25

4

Why should someone allocate memory to `df_dups`? This is very bad practice especially if you have reasonably large size datasets. p.s. this is not what OP actually asked about. you keep the first record while OP wants all the duplicates gone. – M-- Aug 15 '18 at 10:18
1

This is a nice answer to the title and so will probably help people if not OP because it doesn't flag the first occurrence of a repeat. If you're worried about memory you don't have to define df_dups separately. – D A Wells May 05 '21 at 11:09
1

Just put the df code inside the duplicated. Skip the extra step. – luchonacho Oct 01 '21 at 22:10

duplicates in multiple columns

2 Answers2

Linked

Related