deleting redundant records that are not perfect duplicates

Question

I need to remove redundant records from a file, but these redundant records don't look like standard duplicates. The object have is a data frame with the number of school projects the characters of the TV show Recess have worked on together. There are 7,000 observations.

head(have)

obs authA           authB            n_projects
1   TJ.DETWEILER    GRETCHEN.WILSON          11
2   TJ.DETWEILER    KING.BOB                  2
3   TJ.DETWEILER    ASHLEY.SPINELLI           1
4   TJ.DETWEILER    VINCE.LASALLE             3
5   GRETCHEN.WILSON TJ.DETWEILER             11
6   GRETCHEN.WILSON ASHLEY.SPINELLI           7
…   …               …                         …

There is one redundant record shown: the 1st observation contains the same information as the 5th observation. The author order (i.e., who is listed as authA or authB) doesn't matter. I need to remove one of these observations - it doesn't matter which. The new data frame want could look like this:

obs authA           authB            n_projects
1   TJ.DETWEILER    GRETCHEN.WILSON          11
2   TJ.DETWEILER    KING.BOB                  2
3   TJ.DETWEILER    ASHLEY.SPINELLI           1
4   TJ.DETWEILER    VINCE.LASALLE             3
6   GRETCHEN.WILSON ASHLEY.SPINELLI           7
…   …               …                         …

though removing the first obs would also be fine.

akrun · Accepted Answer · 2019-05-01T14:23:46.477

Subset the dataset columns ('authA', 'authB'), loop over the rows, sort and then apply duplicated to create a logical vector and use that to remove the duplicate rows

have[!duplicated(t(apply(have[2:3], 1, sort))),]
#  obs           authA           authB n_projects
#1   1    TJ.DETWEILER GRETCHEN.WILSON         11
#2   2    TJ.DETWEILER        KING.BOB          2
#3   3    TJ.DETWEILER ASHLEY.SPINELLI          1
#4   4    TJ.DETWEILER   VINCE.LASALLE          3
#6   6 GRETCHEN.WILSON ASHLEY.SPINELLI          7

Or an option with pmin/pmax

library(dplyr)
library(stringr)
have %>% 
   filter(!duplicated(str_c(pmin(authA, authB), pmax(authA, authB))))

data

have <- structure(list(obs = 1:6, authA = c("TJ.DETWEILER", "TJ.DETWEILER", 
"TJ.DETWEILER", "TJ.DETWEILER", "GRETCHEN.WILSON", "GRETCHEN.WILSON"
), authB = c("GRETCHEN.WILSON", "KING.BOB", "ASHLEY.SPINELLI", 
"VINCE.LASALLE", "TJ.DETWEILER", "ASHLEY.SPINELLI"), n_projects = c(11L, 
2L, 1L, 3L, 11L, 7L)), class = "data.frame", row.names = c(NA, 
-6L))

deleting redundant records that are not perfect duplicates

1 Answers1

data