2

Many thanks for reading. Apologies for what I'm sure is a simple task.

I have a dataframe: (Edited: Added extra column not to be included in comparison)

b = c(5, 6, 7, 8, 10, 11) 
c = c('david','alan','pete', 'ben', 'richard', 'edd') 
d = c('alex','edd','ben','pete','raymond', 'alan')
df = data.frame(b, c, d) 
df
   b       c       d
1  5   david    alex
2  6    alan     edd
3  7    pete     ben
4  8     ben    pete
5 10 richard raymond
6 11     edd    alan

I want to compare the group of columns c and d with the group of columns d and c. That is, for one row, I want to compare the combined values in c and d with the combined values in d and c for all other rows.

(Note the values could either be characters or integers)

Where these match I want to return the index of those rows which match, preferably as a list of lists. I need to be able to access the indexes without referring to the values in column c or d.

I.e. for the above dataframe, my expected output would be:

c(c(2, 6), c(3, 4))
((2,6), (3,4))

As:

Row 2: (c + d == alan + edd) = row 6: (d + c == edd + alan)
Row 3: (c + d == pete + ben) = row 4: (d + c == ben + pete)

I understand how to determine the match case for two separate columns using match melt, but not if they are joined together and iterating over all possible row combinations.

I envision something like:

lapply(1:6, function(x), ifelse((df$a & df$b) == (df$b & df$a), index(x), 0))

But obviously that is incorrect and won't work.

I consulted the following questions but have been unable to formulate an answer. I have no idea where to begin.

Matching multiple columns on different data frames and getting other column as result

match two columns with two other columns

Comparing two columns in a data frame across many rows

R Comparing each value of all pairs of columns

How can I achieve the above?

Chuck
  • 3,664
  • 7
  • 42
  • 76

1 Answers1

4

You could do something like this. It splits the row indices 1:nrow(df) according to unique sorted strings formed from the columns of df. The sorting ensures that A,B and B,A are treated identically.

duplist <- split(1:nrow(df),apply(df,1,function(r) paste(sort(r),collapse=" ")))

duplist
$`alan edd`
[1] 2 6

$`alex david`
[1] 1

$`ben pete`
[1] 3 4

$`raymond richard`
[1] 5
Andrew Gustar
  • 17,295
  • 1
  • 22
  • 32
  • How to specify column names? I.e. if have 2 other columns with different names? – Chuck May 23 '17 at 11:09
  • This will work with any number of columns, provided df consists only of the columns of names. The names of the columns do not matter. Or you could just substitute `df[,c("e","f")]` for `df` if you only want to compare a particular pair of columns. – Andrew Gustar May 23 '17 at 11:15
  • Ideally I would like to keep only those pairs which have matches, not when only 1 row is returned. Is it possible to auto remove elements with length 1? – Chuck May 23 '17 at 11:20
  • In addition, how to access all of the numbers without referring to the individual list names? (By the way, thank you so much for the answer - it's like 95% of what I need and way ahead of anything I could have put together. Really appreciate the help) – Chuck May 23 '17 at 11:37
  • 1
    Use `duplist <- duplist[sapply(duplist,length)>1]` to just keep those with more than one match. You can also refer to the list elements by number, so `duplist[[1]]` is `2 6`. Also `names(duplist[1])` (note the single []) would be `"alan edd"`. – Andrew Gustar May 23 '17 at 11:49
  • 1
    Thank you. This will work. I really appreciate your answer and your help. Have a nice day :) – Chuck May 23 '17 at 12:36