1

I have two vectors of the same lengths initially. This first is full of protein modification sites I.E. "E123". The second is a unique code for the literature reference to this site. I need to go through these vectors to remove multiple references to the same site from the same paper. That is, if VectorOne[1] == VectorOne[2] && VectorTwo[1] == VectorTwo[2], I need to remove the duplicate. The problem is when I use for loops to loop through the data I am potentially changing the lengths of the vectors meaning that the indices I'm using may no longer be correct.

As soon as I have removed a single element from the vectors the value I am looping to length(primarySite) is too high and the code crashes.

Here is an example of the first 10 values from these two vectors:

primarySite[1:10]
 [1] ""     ""     "D248" "E241" "E242" "E241" "E242" "D244" "D244" "E241"
sitePMID[1:10]
 [1] 24641686 24055347 23955771 23955771 23955771 23955771 23955771 23955771 23955771 23955771

Desired Output:
primarySite[1:6]
 [1] ""     ""     "D248" "E241" "E242" "D244" 
sitePMID[1:6]
 [1] 24641686 24055347 23955771 23955771 23955771 23955771 


for(i in 1:length(primarySite)){
      for(j in (i+1):length(primarySite)){
        if(primarySite[i] == primarySite[j] && sitePMID[i] ==      
sitePMID[j]){
      primarySite <- primarySite[-j]
      sitePMID <- sitePMID[-j]

    } 
  }
}

1 Answers1

0

This is easy if we put the vectors in a data frame:

data = data.frame(primarySite, sitePMID)
deduplicated_data = unique(data)

You can find many other ways in the R-FAQ

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294