Removing duplicate all-way-combinations while retaining all columns

Question

I need to remove duplicate combinations of two columns (feedID and feedID2) within groups (ID), while keeping a large number of other columns in the data set. All the rows with duplicates should be removed, whether it is A in column 2 and B in column 3 or vice versa. Additionally, I would like to remove all rows where there is for example A in both columns, or where there is an NA in one of the columns. I can not sort the data between columns, i.e. if A is in column nr 2, it should remain in column nr 2.

I know this might come across as a duplicate question, but none of the other answers seem to work with my data set, or asks for the same thing. E.g. Finding unique combinations irrespective of position Removing duplicate combinations in R (irrespective of order)

 test <- data.frame(ID= c("49V", "49V","49V", "49V", "49V", "52V", "52V", "52V"),
                      feedID = c("A1", "A1", "G2", "A1", "G2", "B1", "D1",  "D2" ),
                    feedID2 = c("A1", "G2", "A1", "G2", "NA", "D1", "D2",  "NA" ))

 desiredoutput <- data.frame(ID= c("49V", "52V", "52V"),
                      feedID = c("A1","B1", "D1" ),
                    feedID2 = c("G2", "D1", "D2" ))

the following code does not remove duplicates if in different columns

   test2 <- test [!duplicated(test[,c("ID","feedID", "feedID2")]),]

this code does not do anything at all but throws no error

  test2 <-  test%>% distinct(1,2,3) # where numbers refer to the columns

this code produces an error which for dimnames, not sure what that means. I do not get this with my test data, I am not sure why and cannot reproduce the error...

  indx <- !duplicated(t(apply(test, 1, sort))) # finds non - duplicates in sorted rows
   test[indx, ]

Any ideas?

I don't understand what you want to do with the `ID` column. — Andrew Brēza, Jul 18 '17 at 12:51
Unique only keeps the selected columns, I need to keep them all (47!). @cephalopod — Lisarv, Jul 18 '17 at 12:55
For each ID, I need to remove the duplicates. So if both ID 1 and 2 had the same combination of feedIDs, that would not count as a duplicate. Only within each ID can I remove duplicates. @AndrewBrēza — Lisarv, Jul 18 '17 at 12:55
@Lisarv Ok, `nest` then use `unique` and `unnest` should do the job.....an example of `nest` https://stackoverflow.com/questions/44363535/how-to-create-a-list-of-list-and-then-perform-a-vectorised-function-over-it — cephalopod, Jul 18 '17 at 12:59

CPak · Answer 1 · 2017-07-18T13:57:54.537

1

Your data again, but with "NA" changed to NA and stringsAsFactors=F

test <- data.frame(ID= c("49V", "49V","49V", "49V", "49V", "52V", "52V", "52V"),
                   feedID = c("A1", "A1", "G2", "A1", "G2", "B1", "D1",  "D2" ),
                   feedID2 = c("A1", "G2", "A1", "G2", NA, "D1", "D2",  NA ),
                   stringsAsFactors=F)

 library(dplyr)
 test %>% 
  filter(complete.cases(.)) %>%             # Remove rows with NA
  rowwise() %>%                             # Perform next step by row
  mutate(dup=paste0(sort(c(feedID,feedID2)),collapse="")) %>%   # Sort and combine feedID and feedID2
  ungroup() %>%
  group_by(ID) %>%                             # Remove rowwise grouping
  mutate(dup=duplicated(dup)) %>%           # Find duplicated feedID:feedID2 pairs
  filter(dup==F) %>%                        # Remove duplicated pairs
  filter(!(feedID==feedID2)) %>%            # Remove where feedID == feedID2
  select(-dup)                              # Remove dummy column


     ID feedID feedID2
1   49V     A1      G2
2   52V     B1      D1
3   52V     D1      D2

If you only want to look for NA in feedID & feedID2

replace filter(complete.cases(.)) with filter(!is.na(feedID) & !is.na(feedID2))

edited Jul 18 '17 at 13:57

answered Jul 18 '17 at 12:53

CPak

13,260
3
30
48

It works, if I ignore the fact that there are NAs n other columns in the dataset as well... But there is. Can I just specify which column to look for complete.cases in? – Lisarv Jul 18 '17 at 13:04
I've added a solution to your comment at the bottom of my answer. – CPak Jul 18 '17 at 13:29
It does not take group into account though? – Lisarv Jul 18 '17 at 13:48
Your output does not suggest a logic for how you want to account for groups. Could you explain? – CPak Jul 18 '17 at 13:49
"duplicates within groups (ID)" means that duplicates should only be removed within each group (ID) and not across all groups. Thus, only duplicates within 49V, then only duplictaes within 52V etc... – Lisarv Jul 18 '17 at 13:56
Just added `group_by(ID)` – CPak Jul 18 '17 at 14:04
Thanks! This answer is just as good as the one I accepted, if anyone else would wonder! – Lisarv Jul 18 '17 at 14:13

bouncyball · Accepted Answer · 2017-07-18T13:33:49.813

0

Here's a base solution, using the complete.cases function, and also creating a sorted feedID column:

# remove any rows with NA values
test <- test[complete.cases(test[,c('ID', 'feedID','feedID2')]),]
#remove any rows with feedID == feedID2
test <- test[!(test$feedID == test$feedID2),]
# add new feedID3 column
test$feedID3 <- apply(test, 1, function(x) paste(sort(c(x[2], x[3])), collapse = '-'))
# remove any duplicates, and remove last column
test[!duplicated(test[,c('feedID3', 'ID')]), -4]


   ID feedID feedID2
2 49V     A1      G2
6 52V     B1      D1
7 52V     D1      D2

data

Note that we have converted "NA" to NA, and we have also set stringsAsFactors = TRUE

test <- data.frame(ID= c("49V", "49V","49V", "49V", "49V", "52V", "52V", "52V"),
                   feedID = c("A1", "A1", "G2", "A1", "G2", "B1", "D1",  "D2" ),
                   feedID2 = c("A1", "G2", "A1", "G2", NA, "D1", "D2",  NA ),
                   stringsAsFactors = FALSE)

edited Jul 18 '17 at 13:33

answered Jul 18 '17 at 12:56

bouncyball

10,631
19
31

Same comment as for Chi Pak below, it does not take group (ID) into account. My example should have been clearer, but if you replace the D and B to A and G you will see what I mean. – Lisarv Jul 18 '17 at 13:50
I think you edited the question just after I copied the code, now it works perfectly, thanks! – Lisarv Jul 18 '17 at 13:55
Guys, you are always so impatient. ;) Needed to try it properly on my own data first, and it works! Wish I could accept two answers as the other one worked just as fine in the end, but yeah... – Lisarv Jul 18 '17 at 14:12

score 0 · Answer 3 · answered Jul 18 '17 at 13:18

After changing "NA" to NA, and setting stringsAsFactors = F

library(dplyr)
library(stringr)

test <- data.frame(ID= c("49V", "49V","49V", "49V", "49V", "52V", "52V", "52V"),
                   feedID = c("A1", "A1", "G2", "A1", "G2", "B1", "D1",  "D2" ),
                   feedID2 = c("A1", "G2", "A1", "G2", NA, "D1", "D2",  NA ),
                   stringsAsFactors = F)

desiredoutput <- data.frame(ID= c("49V", "52V", "52V"),
                            feedID = c("A1","B1", "D1" ),
                            feedID2 = c("G2", "D1", "D2" ),
                            stringsAsFactors = F)

test %>% 
  # Remove NAs and all rows where the IDs are equal
  filter(!is.na(feedID),                       
         !is.na(feedID2),                      
         feedID != feedID2) %>%                
  # Group rowwise and create a sorted pair of the two ID columns
  rowwise() %>%                                
  mutate(revCheck = str_c(str_sort(c(feedID, feedID2)), collapse = "")) %>% 
  ungroup() %>% 
  # Find distinct ID pairs and keep all variables
  distinct(revCheck,
           .keep_all = T) %>% 
  # Find distinct rows for each ID pair. I kept these separate because I
  # think that's what you're asking for in your example, you want all
  # duplicates in feedID and all duplicates in feedID2 removed, not just
  # duplicate combinations of feedID and feedID2. See .keep_all in ?distinct
  distinct(feedID,
           .keep_all = T) %>% 
  distinct(feedID2,
           .keep_all = T) %>% 
  # Remove the sorted pair id
  select(-revCheck) %>% 
  # Return a dataframe
  as.data.frame(.)

Removing duplicate all-way-combinations while retaining all columns

3 Answers3

data