1

I am trying to find a faster alternative to comparing each observation i with observation j within data frame X. For example, running the following code

for(i in 1:nrow(X)){
 for(j in 1:nrow(X)){
   if ( (sum(c(X$Feature1[i], X$Feature1[j])) == 0)&& ((X$Feature2[i] == X$Feature2[j])|(X$Feature3[i] == X$Feature3[j]) ) ){ 
  X$match[i]<-1
}}}

it takes quite a while to run with 20,000 or so observations. Is there any sorting/comparison algorithm in R that anyone is aware of? Thanks in advance for your time!

pestopasta
  • 51
  • 6
  • 5
    Please provide [example data](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) in order to make your issue reproducible! – jay.sf Jul 12 '18 at 17:59
  • 1
    there's probably a good solution with `outer` ... – Ben Bolker Jul 12 '18 at 18:50

1 Answers1

2

You can do this kind of thing pretty easily in sql, or in R with sqldf.

X$match <- seq(nrow(X))
library(sqldf)
X$match <- sqldf("
  select    sum(b.Feature1 is not null) > 0 as match
  from      X a 
            left join X b
              on  a.Feature1 + b.Feature1 = 0
                  and (
                  a.Feature2 = b.Feature2
                  or a.Feature3 = b.Feature3)
  group by  a.match
  ")[[1]]

An base R version could be

X$match <- as.numeric(
            sapply(seq(nrow(X)), function(i){
                    any( (X$Feature1[i] + X$Feature1 == 0)
                         & (
                           (X$Feature2[i] == X$Feature2)
                           | (X$Feature3[i] == X$Feature3)))}))
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38