1

I have a simple dataframe like this:

| id1 | id2 | location   | comment   |
|-----|-----|------------|-----------|
| 1   | 2   | Alaska     | cold      |
| 2   | 1   | Alaska     | freezing! |
| 3   | 4   | California | nice      |
| 4   | 5   | Kansas     | boring    |
| 9   | 10  | Alaska     | cold      |

The first two rows are duplicates because id1 and id2 both went to Alaska. It doesn't matter that their comment are different.

How can I remove one of these duplicates -- either one would be fine to remove.

I was first trying to sort id1 and id2, then get the index where they are duplicated, then go back and use the index to subset the original df. But I can't seem to pull this off.

df <- data.frame(id1 = c(1,2,3,4,9), id2 = c(2,1,4,5,10), location=c('Alaska', 'Alaska', 'California', 'Kansas', 'Alaska'), comment=c('cold', 'freezing!', 'nice', 'boring', 'cold'))
Monica Heddneck
  • 2,973
  • 10
  • 55
  • 89

1 Answers1

2

We can use apply with MARGIN=1 to sort by row for the 'id' columns, cbind with 'location' and then use duplicated to get a logical index that can be used for removing/keeping the rows.

df[!duplicated(data.frame(t(apply(df[1:2], 1, sort)), df$location)),]
#   id1 id2   location comment
#1   1   2     Alaska    cold
#3   3   4 California    nice
#4   4   5     Kansas  boring
#5   9  10     Alaska    cold
akrun
  • 874,273
  • 37
  • 540
  • 662
  • what do I do if my 'real' dataframe has 6 other columns that are similar to `location`: where it doesn't matter what the values are (but I'd like to keep them). Can I just pass them in like `c(df$location, df$col6, df$col7, df$col8)`? – Monica Heddneck Jul 29 '16 at 02:37
  • @MonicaHeddneck You can subset the dataset like `df[c("location", "col6", "col7", "col8")]` If you are using that also in the duplicated, `duplicated(cbind(t(apply(....), df[c("location",...)]))` – akrun Jul 29 '16 at 02:39
  • I'm confused...sorry! You solution didn't have `cbind`. And I'm not sure I get what you mean. I just would like to carry all the rest of the columns along with me. – Monica Heddneck Jul 29 '16 at 02:42
  • @MonicaHeddneck Instead of `cbind` i used `data.frame` – akrun Jul 29 '16 at 02:43
  • so do you think this will work? `df[!duplicated(data.frame(t(apply(df[1:2], 1, sort)), c(df$location, df$othercol1, df$othercol2))),]`? – Monica Heddneck Jul 29 '16 at 02:44
  • @MonicaHeddneck I would do `df[!duplicated(data.frame(t(apply(df[1:2], 1, sort)), df[c("location", "othercol1", "othercol2")])),]` I think `c` wouldn't work because you are concatenating `vector`s. – akrun Jul 29 '16 at 02:47
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/118595/discussion-between-monica-heddneck-and-akrun). – Monica Heddneck Jul 29 '16 at 04:57
  • 1
    For me, this worked fine in preserving all columns in my df: `df[!duplicated(data.frame(t(apply(df[1:2], 1, sort)))),]` – enileve Jun 17 '22 at 15:10