0

I'm having a blind spot with this particular problem for some reason.

Sample data:

         date id1 id2
   1888-09-08   A   E
   1888-09-15   B   M
   1888-09-22   C   L
   1888-09-22   D   B
   1888-09-29   A   R
   1888-10-27   F   G
   1888-11-17   J   H
   1888-11-17   E   O
   1888-11-24   H   F
   1888-12-15   G   L
   1889-01-19   K   G
   1889-02-09   M   E

What I am trying to do is create a 4th variable that will start at 1 and increase by 1 at the observation whenever a value from either 'id1' or 'id2' is repeated. It will increase by 1 again whenever a value is duplicated starting from the observation when it last increased.

Expected results:

         date id1 id2 value
   1888-09-08   A   E     1
   1888-09-15   B   M     1
   1888-09-22   C   L     1
   1888-09-22   D   B     2
   1888-09-29   A   R     2
   1888-10-27   F   G     2
   1888-11-17   J   H     2
   1888-11-17   E   O     2
   1888-11-24   H   F     3
   1888-12-15   G   L     3
   1889-01-19   K   G     4
   1889-02-09   M   E     4

As you can see, on the fourth row, the id "B" repeated as it had been seen before. Therefore the value variable increased to '2'. Starting with this row, we then search for the next duplicated 'id'. That happens on row 9, where a "H" repeats (it had been seen in a previous observation since the last 'value' change). The value then changes to 3, and next changes on the next row but one as a 'G' repeats.

I hope this makes sense.

I haven't got very far. My main idea consists of turning the two columns into a long vector and then find the first duplicated value. From this, we could work out that all elements that belonged to different rows would get a '1' whereas this element '8' and it's partner '7' would get a 2. We could go from the 7th element (the first in the pair) forward looking for more duplicates...

temp <- as.matrix(c(rbind(as.character(dfr[,2]), as.character(dfr[,3])))) #put two columns into single vector

min(which(duplicated(temp))) #8

I'm not sure how effective this will be.

jalapic
  • 13,792
  • 8
  • 57
  • 87
  • does this http://stackoverflow.com/questions/22893028/tag-all-duplicate-rows-in-r-as-in-stata help? – bjoseph Mar 01 '15 at 02:46

1 Answers1

2

This solution isn't going to be fast but it works.

x<-sample(LETTERS[1:5],40,replace=TRUE)
y<-sample(LETTERS[1:5],40,replace=TRUE)
df<-data.frame(x=x,y=y,value=1)
df$x<-as.character(df$x)
df$y<-as.character(df$y)

vec<-c(df[1,c('x','y')])
for(i in 2:nrow(df)){

    if(df[i,'x'] %in% vec | df[i,'y'] %in% vec) {  #checking whether in vector

        print('in') #printing when duplicates are found
        df[i,'value']=df[i-1,'value']+1 #adding to the value
        vec<-c(df[i,c('x','y')]) #resetting the vector

    } else {

        df[i,'value']=df[i-1,'value'] 
        vec<-c(vec,df[i,c('x','y')])   #if not found i add to the vector 

    }

}
Jason
  • 1,559
  • 1
  • 9
  • 14
  • On further review, I'm finding that this isn't working...e.g. set.seed(19) - the value column is not correct after the procedure. i'm wondering if you can repeat the error? – jalapic Mar 01 '15 at 20:31
  • Yes, it wasn't picking up the first row both at the start and after finding a duplicate. I changed the initial setup of 'vec' to account for that. Does it work now? – Jason Mar 01 '15 at 20:37
  • Yes, great - I think that was exactly it - the first row - thanks. – jalapic Mar 01 '15 at 20:41
  • Sure thing. Make sure you pick up both edits. I edited the initial setting of vec and when we reset upon finding a duplicate since it was a bigger mistake than I initially thought... – Jason Mar 01 '15 at 20:47