2

Similar to this question, I have a data frame and would like to extract the rows that are not unique in their combination of values in several specific columns.

E.g., I have a data frame df:

> df<-data.frame(c(1,2,3,4),c(T,F,T,T),c("a","b","c","b"),c("b","d","e","a"))
> df
     [,1] [,2]    [,3] [,4]
[1,] "1"  "TRUE"  "a"  "b" 
[2,] "2"  "FALSE" "b"  "d" 
[3,] "3"  "TRUE"  "c"  "e" 
[4,] "4"  "TRUE"  "b"  "a" 

I would like to test whether the combination of values in coumn 2, 3 and 4 is unique or duplicate for the rows of the data frame. However, I don't want to classify the first occurence of a combination as unique and all subsequent combinations as duplicates but rather all occurences of non-unique combinations as duplicates.

In this example, rows 1 and 4 are duplicates and rows 2 and 3 are unique in their combinations of values in columns 2, 3 and 4.

Any help would be much appreciated.

Community
  • 1
  • 1
atreju
  • 965
  • 6
  • 15
  • 36
  • Your data.frame is a matrix. Use `data.frame` instead of `cbind` to create a data.frame. – akrun Sep 01 '15 at 10:31

2 Answers2

1

An alternative approach using dplyr. It uses all combinations of rows and checks if all the elements of one row belong to another row. In the end instead of True/False you get a number for each row that represents how many times it was found to match with other rows. It will be more obvious if you run the script step by step.

df<-data.frame(x1 =c(1,2,3,4),
               x2 = c(T,F,T,T),
               x3 = c("a","b","c","b"),
               x4 = c("b","d","e","a"), stringsAsFactors = F)

library(dplyr)


df %>%                                                                                
  rowwise() %>%                                                                     # for each row
  do(data.frame(., df2=df, stringsAsFactors=F)) %>%                                 # combine each row with all rows of dataset
  filter(x1 != df2.x1) %>%                                                          # exclude cases of self combinations
  rowwise() %>%                                                                     # for each row combination
  mutate(match = 
           ifelse(sum(c(x2,x3,x4) %in% c(df2.x2, df2.x3, df2.x4))==3, 1, 0)) %>%    # flag a match when all 3 elements of one row belong to the elements of the other row 
  group_by(x1,x2,x3,x4) %>%                                                         # group by rows of initial dataset
  summarise(sum_match = sum(match)) %>%                                             # calculate how many times they match with other rows
  ungroup


#   x1    x2 x3 x4 sum_match
# 1  1  TRUE  a  b         1
# 2  2 FALSE  b  d         0
# 3  3  TRUE  c  e         0
# 4  4  TRUE  b  a         1
AntoniosK
  • 15,991
  • 2
  • 19
  • 32
0

We sort the columns 3:4 in the 'df' by row using apply with MARGIN=1, transpose (t) it and assign the output to the corresponding columns. To get the logical index of all the duplicates, we can apply duplicated in the default direction and in reverse direction using fromLast=TRUE. Here, I assumed that the first column will not be used for considering the duplicate elements.

df[3:4] <- t(apply(df[3:4], 1, sort))
duplicated(df[-1])|duplicated(df[-1], fromLast=TRUE)
#[1]  TRUE FALSE FALSE  TRUE
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks, it works, though I had to get rid of reassigning the columns 3 and 4 in order to keep my data frame in its current form. – atreju Sep 01 '15 at 11:07
  • 1
    @atreju In that case, you can create a new dataframe and then do the duplicated `dfN <- cbind(df[2],t(apply(df[3:4], 1, sort))); duplicated(dfN)|duplicated(dfN, fromLast=TRUE)` – akrun Sep 01 '15 at 11:09