13

I have a data frame in R that contains the gene ids of paralogous genes in Arabidopsis, looking something like this:

gene_x    gene_y
AT1       AT2
AT3       AT4
AT1       AT2
AT1       AT3
AT2       AT1

with the 'ATx' corresponding to the gene names.

Now, for downstream analysis, I would want to continue only with the unique pairs. Some pairs are just simple duplicates and can be removed easily upon using the duplicated() function. However, the fifth row in the artificial data frame above is also a duplicate, but in reversed order, and which will not be picked up by the duplicated(), nor by the unique() function.

Any ideas in how to remove these rows?

tmfmnk
  • 38,881
  • 4
  • 47
  • 67
KoenVdB
  • 293
  • 2
  • 12
  • Sort first, then find duplicates. I suggest you provide us a with a reproducible example. http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Roman Luštrik Mar 31 '14 at 08:06

3 Answers3

13
mydf <- read.table(text="gene_x    gene_y
AT1       AT2
AT3       AT4
AT1       AT2
AT1       AT3
AT2       AT1", header=TRUE, stringsAsFactors=FALSE)

Here's one strategy using apply, sort, paste, and duplicated:

mydf[!duplicated(apply(mydf,1,function(x) paste(sort(x),collapse=''))),]
  gene_x gene_y
1    AT1    AT2
2    AT3    AT4
4    AT1    AT3

And here's a slightly different solution:

mydf[!duplicated(lapply(as.data.frame(t(mydf), stringsAsFactors=FALSE), sort)),]
  gene_x gene_y
1    AT1    AT2
2    AT3    AT4
4    AT1    AT3
Thomas
  • 43,637
  • 12
  • 109
  • 140
  • 4
    Be careful! By using `collapse=''` you may unintentionally delete some non-duplicated gene pairs. For example, `ABC` - `DE` pair will be deleted if there are `AB` - `CDE` pair (since they both form `ABCDE` when pasted together). So, by changing `collapse=''` to `collapse='_'` this solution works perfectly. – smtnkc Nov 20 '18 at 08:11
  • How would you change the code to exclude also the first observation of a duplicate, i.e. to keep only rows 2 and 3 of the original data frame in the output? – sakwa Nov 20 '19 at 10:22
12

A dplyr possibility could be:

mydf %>%
 group_by(grp = paste(pmax(gene_x, gene_y), pmin(gene_x, gene_y), sep = "_")) %>%
 slice(1) %>%
 ungroup() %>%
 select(-grp)

  gene_x gene_y
  <chr>  <chr> 
1 AT1    AT2   
2 AT1    AT3   
3 AT3    AT4  

Or:

mydf %>%
 group_by(grp = paste(pmax(gene_x, gene_y), pmin(gene_x, gene_y), sep = "_")) %>%
 filter(row_number() == 1) %>%
 ungroup() %>%
 select(-grp)

Or:

mydf %>%
 group_by(grp = paste(pmax(gene_x, gene_y), pmin(gene_x, gene_y), sep = "_")) %>%
 distinct(grp, .keep_all = TRUE) %>%
 ungroup() %>%
 select(-grp)

Or using dplyr and purrr:

mydf %>%
 group_by(grp = paste(invoke(pmax, .), invoke(pmin, .), sep = "_")) %>%
 slice(1) %>%
 ungroup() %>%
 select(-grp)

And as of purrr 0.3.0 invoke() is retired, exec() should be used instead:

mydf %>%
 group_by(grp = paste(exec(pmax, !!!.), exec(pmin, !!!.), sep = "_")) %>%
 slice(1) %>%
 ungroup() %>%
 select(-grp)

Or:

df %>%
 rowwise() %>%
 mutate(grp = paste(sort(c(gene_x, gene_y)), collapse = "_")) %>%
 group_by(grp) %>%
 slice(1) %>%
 ungroup() %>%
 select(-grp)
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
4

Another tidyverse-centric approach but using purrr:

library(tidyverse)

c_sort_collapse <- function(...){
  c(...) %>% 
    sort() %>% 
    str_c(collapse = ".")
}

mydf %>% 
  mutate(x_y = map2_chr(gene_x, gene_y, c_sort_collapse)) %>% 
  distinct(x_y, .keep_all = TRUE) %>% 
  select(-x_y)
#>   gene_x gene_y
#> 1    AT1    AT2
#> 2    AT3    AT4
#> 3    AT1    AT3
Bryan Shalloway
  • 748
  • 7
  • 15