1

I am trying to look at protein sequence homology using R, and I'd like to go through a data frame looking for identical pairs of Position and Letter. The data look similar to the frame below:

Letter <- c("A", "B", "C", "D", "D", "E", "G", "L")
Position <- c(1, 2, 3, 4, 4, 5, 6, 7)
data.set <- cbind(Position, Letter)

Which yields:

     Position Letter
[1,] "1"      "A"   
[2,] "2"      "B"   
[3,] "3"      "C"   
[4,] "4"      "D"   
[5,] "4"      "D"   
[6,] "5"      "E"   
[7,] "6"      "G"   
[8,] "7"      "L"   

I'd like to loop through and find all identical observations (in this case, observations 4 and 5), but I'm having difficulty in discovering the best way to do it.

I'd like the resultant data frame to look like:

     Position Letter
[1,] "4"      "D"   
[2,] "4"      "D"   

The ways I've tried to do this ended up yielding this code, but unfortunately it returns one value of TRUE because I realized that I am comparing two identical data frames:

> identical(data.set[1:nrow(data.set),1:2], data.set[1:nrow(data.set),1:2])
[1] TRUE

I'm not sure if looping through using the identical() function would be the best way? I'm sure there's a more elegant solution that I am missing.

Thanks for any help!

Phantom Photon
  • 768
  • 2
  • 10
  • 20

2 Answers2

1

Try the unique function:

unique(data.set)

...

Karsten W.
  • 17,826
  • 11
  • 69
  • 103
0

You can use duplicated using fromLast to go in two directions:

data.set[(duplicated(data.set)==T | duplicated(data.set, fromLast = TRUE) == T),]

#     Position Letter
#[1,] "4"      "D"   
#[2,] "4"      "D"  
jalapic
  • 13,792
  • 8
  • 57
  • 87