1

I want to keep only non-duplicated rows in a dataset. This is going one step beyond "removing duplicates"; that is, I want to eliminate ALL copies of duplicated rows, not just the duplicate copies, and keep only the rows that were never duplicated in the first place.

Dataset:

    df <- data.frame(A = c(5,5,6,7,8,8,8), B = sample(1:100, 7))
    df
    A  B
    5 91
    5 46
    6 41
    7 98
    8 35
    8 56
    8 36

Want to turn it into:

    A  B
    6 41
    7 98

Here is what I tried using dplyr:

    df_single <- df %>% count(A) %>% filter(n == 1)
    # Returns all the values of A for which only one row exists

    df %>% filter(A == df_single$A)
    # Trying to subset only those values of A, but this returns error 
    # "longer object length is not a multiple of shorter object length"

Thanks for your help. A nice bonus would be additional code for doing the opposite (keeping all the OTHER rows - i.e., eliminating only the non-duplicated rows from the dataset).

Pierre L
  • 28,203
  • 6
  • 47
  • 69
gnotnek
  • 309
  • 4
  • 14

1 Answers1

3

Try this (no packages needed):

subset(df, !duplicated(A) & !duplicated(A, fromLast = TRUE))

giving:

  A  B
3 6 41
4 7 98
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thank you! That's a clever solution. And apparently I need to improve my Googling skills, as I couldn't find the duplicate question/answer. – gnotnek Nov 18 '15 at 18:23