Subsetting Identical Observations in R

Question

I am trying to look at protein sequence homology using R, and I'd like to go through a data frame looking for identical pairs of Position and Letter. The data look similar to the frame below:

Letter <- c("A", "B", "C", "D", "D", "E", "G", "L")
Position <- c(1, 2, 3, 4, 4, 5, 6, 7)
data.set <- cbind(Position, Letter)

Which yields:

     Position Letter
[1,] "1"      "A"   
[2,] "2"      "B"   
[3,] "3"      "C"   
[4,] "4"      "D"   
[5,] "4"      "D"   
[6,] "5"      "E"   
[7,] "6"      "G"   
[8,] "7"      "L"

I'd like to loop through and find all identical observations (in this case, observations 4 and 5), but I'm having difficulty in discovering the best way to do it.

I'd like the resultant data frame to look like:

     Position Letter
[1,] "4"      "D"   
[2,] "4"      "D"

The ways I've tried to do this ended up yielding this code, but unfortunately it returns one value of TRUE because I realized that I am comparing two identical data frames:

> identical(data.set[1:nrow(data.set),1:2], data.set[1:nrow(data.set),1:2])
[1] TRUE

I'm not sure if looping through using the identical() function would be the best way? I'm sure there's a more elegant solution that I am missing.

Thanks for any help!

score 1 · Answer 1 · answered Jan 28 '15 at 20:06

1

Try the unique function:

unique(data.set)

...

answered Jan 28 '15 at 20:06

Karsten W.

17,826
11
69
103

Oh my goodness, this is a bit of a facepalm moment... Thank you! – Phantom Photon Jan 28 '15 at 20:26

score 0 · Accepted Answer · answered Jan 28 '15 at 20:09

0

You can use duplicated using fromLast to go in two directions:

data.set[(duplicated(data.set)==T | duplicated(data.set, fromLast = TRUE) == T),]

#     Position Letter
#[1,] "4"      "D"   
#[2,] "4"      "D"

answered Jan 28 '15 at 20:09

jalapic

13,792
8
57
87

Thank you, this was very valuable for me! – Phantom Photon Jan 28 '15 at 20:25

Subsetting Identical Observations in R

2 Answers2