Unique rows, considering two columns, in R, without order

Question

Unlike questions I've found, I want to get the unique of two columns without order.

I have a df:

df<-cbind(c("a","b","c","b"),c("b","d","e","a"))
> df
     [,1] [,2]
 [1,] "a"  "b" 
 [2,] "b"  "d" 
 [3,] "c"  "e" 
 [4,] "b"  "a"

In this case, row 1 and row 4 are "duplicates" in the sense that b-a is the same as b-a.

I know how to find unique of columns 1 and 2 but I would find each row unique under this approach.

That is not a data.frame but a matrix; if it were a df, `unique(df)` would do the trick. Try `df<-data.frame(c("a","b","c","b"),c("b","d","e","a"))` first. — Frank, Feb 18 '15 at 00:47
I don't think so, `unique(df)` doesn't check across columns to see that `c('a','b')` is effectively the same as `c('b','a')` (and why should it?). Slightly more work ... — r2evans, Feb 18 '15 at 00:52

A5C1D2H2I1M1N2O1R2T1 · Answer 1 · 2015-02-18T02:37:41.723

15

If it's just two columns, you can also use pmin and pmax, like this:

library(data.table)
unique(as.data.table(df)[, c("V1", "V2") := list(pmin(V1, V2),
                         pmax(V1, V2))], by = c("V1", "V2"))
#    V1 V2
# 1:  a  b
# 2:  b  d
# 3:  c  e

A similar approach using "dplyr" might be:

library(dplyr)
data.frame(df, stringsAsFactors = FALSE) %>% 
  mutate(key = paste0(pmin(X1, X2), pmax(X1, X2), sep = "")) %>% 
  distinct(key)
#   X1 X2 key
# 1  a  b  ab
# 2  b  d  bd
# 3  c  e  ce

edited Feb 18 '15 at 02:37

answered Feb 18 '15 at 02:05

A5C1D2H2I1M1N2O1R2T1

190,393
28
405
485

Why is `by = c("V1", "V2")` needed? It seems that omitting it gives the same result. – Dan Aug 12 '19 at 15:20

score 8 · Accepted Answer · answered Feb 18 '15 at 00:59

8

There are lot's of ways to do this, here is one:

unique(t(apply(df, 1, sort)))
duplicated(t(apply(df, 1, sort)))

One gives the unique rows, the other gives the mask.

answered Feb 18 '15 at 00:59

jimmyb

4,227
2
23
26

This approach returns the first unique occurence of a row (rows 1,2,3) but it does not return the duplicate rows (rows 1,4)/unique rows (2,3) as defined by the original poster. – atreju Sep 01 '15 at 10:05

score 3 · Answer 3 · answered Feb 18 '15 at 02:44

3

You could use igraph to create a undirected graph and then convert back to a data.frame

unique(get.data.frame(graph.data.frame(df, directed=FALSE),"edges"))

answered Feb 18 '15 at 02:44

mnel

113,303
27
265
254

score 0 · Answer 4 · answered Feb 18 '15 at 00:59

0

If all of the elements are strings (heck, even if not and you can coerce them), then one trick is to create it as a data.frame and use some of dplyr's tricks on it.

library(dplyr)
df <- data.frame(v1 = c("a","b","c","b"), v2 = c("b","d","e","a"))
df$key <- apply(df, 1, function(s) paste0(sort(s), collapse=''))
head(df)
##   v1 v2 key
## 1  a  b  ab
## 2  b  d  bd
## 3  c  e  ce
## 4  b  a  ab

The $key column should now tell you the repeats.

df %>% group_by(key) %>% do(head(., n = 1))
## Source: local data frame [3 x 3]
## Groups: key
##   v1 v2 key
## 1  a  b  ab
## 2  b  d  bd
## 3  c  e  ce

answered Feb 18 '15 at 00:59

r2evans

141,215
6
77
149

1

This is not very good use of `dplyr`. I would suggest looking at `distinct` if you wanted to go this route. On a small (100k rows) dataset, this approach presently takes > 4 seconds on my system while the base R approach takes ~ 1.3 seconds and the data.table approach takes ~ 0.03 seconds. – A5C1D2H2I1M1N2O1R2T1 Feb 18 '15 at 02:28
1

Using `pmin` and `pmax` is where the speed comes in. A `dplyr` variant of my `data.table` answer runs at ~ 0.05 seconds. For reference, the variant I'm referring to looks like this: `data.frame(df, stringsAsFactors = FALSE) %>% mutate(key = paste0(pmin(X1, X2), pmax(X1, X2), sep = "")) %>% distinct(key)` – A5C1D2H2I1M1N2O1R2T1 Feb 18 '15 at 02:32
Your code is certainly impressive. I'm still learning the ins-and-outs of `dplyr`, which must seem obvious to you. – r2evans Feb 18 '15 at 04:39

Unique rows, considering two columns, in R, without order

4 Answers4

Linked

Related