Subset data frame based on unique combination of multiple conditions

Question

I cant seem to find an answer through search to this on SO. I'm trying to select a subset of a data.frame based on four conditions (lon1, lon2, lat1 and lat2). I have a huge dissimilarity matrix that has been vectorized and the sites (lon1, lon2, lat1 and lat2) cbind to it. Here is an example data frame:

out1 <- data.frame(lon1 = sample(1:10), lon2 = sample(1:10), 
                   lat1 = sample(1:10), lat2 = sample(1:10), 
                   dissimilarity = sample(seq(0,1,.1),10))
> out1
     lon1   lon2    lat1 lat2 dissimilarity
1     2      6      4      4           0.6
2     4      2      1      3           1.0
3    10      9      2      6           0.0
4     3      1     10      8           0.5
5     9      5      9      1           0.8
6     5      7      5      9           0.9
7     1      8      6      7           0.2
8     8      3      8      5           0.7
9     7      4      3     10           0.3
10    6     10      7      2           0.1

out2 <- out1[c(2,5,6,8),]

   lon1 lon2 lat1 lat2 dissimilarity
1     4   2   1      3           1.0
2     9   5   9      1           0.8
3     5   7   5      9           0.9
4     8   3   8      5           0.7

I tried using %in% function a few times in this manner:

test <- out1[(out1$lon1 %in% out2$lon1) & (out1$lon2 %in% out2$lon2) & 
             (out1$lat1 %in% out2$lat1) & (out1$lat2 %in% out2$lat2), ]

This seems to work for the basic example I provide here. But, when I apply it to my huge data frame (with many lat and lons repeated) I get back a larger subset than the unqiue combinations I require. I assume because the match function in %in% can only match a vector. So it's matching condition1 & condition2 & condition3 & condition4 And thus is returning a results that gives a subset which is the same as the orginal out1. I want to get only the case when all four values are the same for that row. This way I'll get a subset of the data for the pairwise dissimilarities I'm interested in.

Any ideas on how to subset by row based on a unique combination of four variables would be greatly appreciated.

can you be more specific, I dont understand because all the 4 conditions are same in your case. Is `test` your expected output? — Nishanth, May 13 '13 at 09:19
@e4e5f4, he wants the whole row to be matched. by doing individual comparisons and &ing them, you get all possible combinations, not necessarily the same identical elements in a row. It's not different from [**this question (to which we both answered)**](http://stackoverflow.com/questions/16398420/subset-of-a-data-frame-in-r) — Arun, May 13 '13 at 09:30

Arun · Accepted Answer · 2013-05-13T09:18:02.630

2

I think this is what you're looking for. Basically you want duplicated function that returns what you're expecting.

out1[duplicated(rbind(out2, out1)[, 1:4])[-seq_len(nrow(out2))], ]

How does it work? First we rbind out2 and out1. Then call duplicated on it. The columns that are in out2 and in out1 will be marked as TRUE in out1. This is because the first occurrence is on out2 and it was not duplicated there. But the second time it finds the entry, it will be in out1 and so it'll know there has been a row exactly like this before. So, it'll mark it as duplicated. We now have all duplicated entries. From this we subset only the elements of out1 by removing the first n elements where n = nrow(out1). Then we subset using this logical vector on out1.

You can go through this explanation and run the code step by step to follow-up. Here's a break-down version for working out the logic.

tt <- rbind(out2, out1)
tt.dup <- duplicated(tt[, 1:4)] # marks all duplicate rows in out1 from 1st 4 cols
tt.dup <- tt.dup[-seq_len(nrow(out2))] # remove all out2 entries (first n)
out1[tt.dup, ] # index only TRUE/duplicated elements from out1

edited May 13 '13 at 09:18

answered May 13 '13 at 09:12

Arun

116,683
26
284
387

Hi Guys, Thanks for the prompt answer. This looks really promising, but I'm still having a few issues. I'm trying to extract dissimilarities from 400 sites (dist = (400*399)/2 = 79800 rows), however I only get 38528 rows returned using the above method. Any ideas on why this might be happening? Does duplicate handle negatives? Kind regards, Skip. – Skiptoniam May 13 '13 at 22:58
Skiptoniam, if you provide a small reproducible example of where this code goes not as expected by editing your post, I'd be glad to help. – Arun May 13 '13 at 23:11
Hi Arun, I've figured it out. The code wasn't working because I created the subset of dissimilarities on a different row order. Once I ordered the dissimilarities by latitude (rather than longitude) the unique combinations matched and I got 79800 rows returned. Thank you for all your help. – Skiptoniam May 13 '13 at 23:48

Subset data frame based on unique combination of multiple conditions

1 Answers1