36

I have a data frame like so

> df
  a  b c    d
1 1  2 A 1001
2 2  4 B 1002
3 3  6 B 1002
4 4  8 C 1003
5 5 10 D 1004
6 6 12 D 1004
7 7 13 E 1005
8 8 14 E 1006

I want to remove the rows where there are repeated values in column c AND column d. So in this example rows 2,3,5 and 6 would removed.

I have used this, which works:

df[!(df$c %in% df$c[duplicated(df$c)] & df$d %in% df$d[duplicated(df$d)]),]
>df
  a  b c    d
1 1  2 A 1001
4 4  8 C 1003
7 7 13 E 1005
8 8 14 E 1006

but it seems clunky and I can't help but think there is a better way. Any suggestions?

In case anyone wants to re-create the data-frame here is the dput:

df <- data.frame(
  a = seq(1, 8, by = 1),
  b = c(2, 4, 6, 8, 10, 12, 13, 14),
  c = factor(c("A", "B", "B", "C", "D", "D", "E", "E")),
  d = c(1001, 1002, 1002, 1003, 1004, 1004, 1005, 1006)
)
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
Davy Kavanagh
  • 4,809
  • 9
  • 35
  • 50

2 Answers2

35

It works if you use duplicated twice:

df[!(duplicated(df[c("c","d")]) | duplicated(df[c("c","d")], fromLast = TRUE)), ]

  a  b c    d
1 1  2 A 1001
4 4  8 C 1003
7 7 13 E 1005
8 8 14 E 1006
Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168
  • 3
    Can someone explain to me why this works? I used it, and it does work, but I don't understand. Why does "NOT duplicated" OR "duplicated" de dup the data frame? – jessi Sep 18 '14 at 17:38
  • 4
    @Jessi The formula is *NOT(duplicated OR duplicated)*. The first duplicated does not identify the first occurrences of duplicated values and the second duplicated does not identify the last occurrences of duplicated values. Together, they identify *all* duplicated values. – Sven Hohenstein Sep 18 '14 at 18:08
  • 2
    How would you keep the first duplicated record only? – mob May 26 '17 at 21:32
  • 4
    @mob Do you mean `df[!duplicated(df[c("c","d")]), ]`? – Sven Hohenstein May 26 '17 at 21:55
28

Make a new object with the 2 columns:

df_dups <- df[c("c", "d")]

Now apply it to your main df:

df[!duplicated(df_dups),]

Looks neater and easy to see/change columns that you are using.

Tunn
  • 1,506
  • 16
  • 25
  • 4
    Why should someone allocate memory to `df_dups`? This is very bad practice especially if you have reasonably large size datasets. p.s. this is not what OP actually asked about. you keep the first record while OP wants all the duplicates gone. – M-- Aug 15 '18 at 10:18
  • 1
    This is a nice answer to the title and so will probably help people if not OP because it doesn't flag the first occurrence of a repeat. If you're worried about memory you don't have to define df_dups separately. – D A Wells May 05 '21 at 11:09
  • 1
    Just put the df code inside the duplicated. Skip the extra step. – luchonacho Oct 01 '21 at 22:10