7

I am having trouble subsetting a data.frame based on reciprocity of values in row and column.

Here is a example df to illustrate the problem:

rater <- c(21, 23, 26, 24)
ratee <- c(24, 21, 23, 21)
rating.data <- data.frame(rater, ratee)

Output:

   rater ratee
1    21    24
2    23    21
3    26    23
4    24    21

I would like to subset this df by only keeping the rows that have reciprocal values.

The resulting subset should look like this:

   rater ratee
1    21    24
4    24    21

Any thoughts would be much appreciated!

SeekingData
  • 115
  • 6
  • You using the word reciprocal wrongly. 1/x is the reciprocal value of x. – Andre Elrico Sep 05 '17 at 15:36
  • do you want to end up with a symmetric matrix??? – Andre Elrico Sep 05 '17 at 15:37
  • https://stackoverflow.com/questions/28574006/unique-rows-considering-two-columns-in-r-without-order also check the link – BENY Sep 05 '17 at 15:46
  • 1
    I have a matrix from using the igraph package, an adjacency matrix. That package can only provide a ratio of the reciprocal values. I am using the language they used for network analysis. – SeekingData Sep 05 '17 at 15:47

5 Answers5

9

We could sort by row and then use duplicated

m1 <- t(apply(rating.data, 1, sort))
rating.data[duplicated(m1)|duplicated(m1, fromLast = TRUE),]
#   rater ratee
#1    21    24
#4    24    21
akrun
  • 874,273
  • 37
  • 540
  • 662
6

Another possibility:

library(dplyr)
rating.data %>% inner_join(.,.,by=c("rater" = "ratee","ratee"="rater"))

Or this, for some reason it's twice faster on your small example (though slower than akrun's solution):

merge(rating.data,setNames(rating.data,rev(names(rating.data))))

to keep the second solution flexible with your additional columns:

merge(rating.data,setNames(rating.data[,c("rater","ratee")],c("ratee","rater")))
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • Thanks! This is actually what I was looking for as I did not want to loose the rest of the df. This allows me to specify the columns I have in mind. Thank you for reading my mind! – SeekingData Sep 05 '17 at 15:50
  • You're welcome, akrun's solution could easily be adapted though, if you just put `rating.data[,c("rater","ratee")]` in the first line, it might be faster too if it's a concern. – moodymudskipper Sep 05 '17 at 15:57
2
library(data.table)
N=10#number of rows 
dt1<-data.table(a=1:N,b=sample(N))#create the data.table that holds the info

dt1[,d:=ifelse(a<b,paste0(a,"_",b),paste0(b,"_",a))]#create unique key per pair  respecting the rule "min_max"
setkey(dt1,d)#setting the key 
dt1[dt1[,.N,d][N!=1],.(a,b)] #keep only the pairs that appear more than once
amonk
  • 1,769
  • 2
  • 18
  • 27
2

You can also use pmin and pmax to assist with grouping and then filter on all groups having more than one entry, i.e.

library(dplyr)

df %>% 
 group_by(grp = paste0(pmin(rater, ratee), pmax(rater, ratee))) %>% 
 filter(n() > 1) %>% 
 ungroup() %>% 
 select(-grp)

which gives,

# A tibble: 2 x 2
  rater ratee
  <dbl> <dbl>
1    21    24
2    24    21
Sotos
  • 51,121
  • 6
  • 32
  • 66
1

Similar in spirit to akrun's method, rbind the data.frame to a copy with the columns reversed. find duplicates, starting from the bottom, which will then return TRUE for the rows in the original data.frame. Subset this logical vector with [seq_len(nrow(rating.data))] to return a vector of the appropriate length to [ in order to subset on the rows.

rating.data[duplicated(rbind(rating.data,
                             unname(unclass(rating.data[2:1]))),
                       fromLast=TRUE)[seq_len(nrow(rating.data))],]
  rater ratee
1    21    24
4    24    21
lmo
  • 37,904
  • 9
  • 56
  • 69