How to match columns or strings in R regardless of order

Question

I am trying to find common edges between coexpression networks of genes. Here is a toy example:

Dataset 1    Dataset 2    Dataset 3
A:B          A:B          A:B
D:E          NA           D:E

So by intersecting these columns, A:B is an edge to be included, but not D:E.

My issue comes in that my edges can be represented either way round: either A:B or B:A. I also have A and B as separate columns. So any one data frame will look something like this:

Gene1    Gene2    Edge
A        B        A:B

or this:

Gene1    Gene2    Edge
B        A        B:A

This means when trying to intersect you could get something like the following:

Dataset 1    Dataset 2    Dataset 3    Dataset 4    Dataset5
B:A          A:B          A:B          B:A          A:B

Matching strings wouldn't work as they would be considered different, even though the relationship is still the same

How do I subset a dataframe that allows me to find a gene pair regardless of the order of the gene? Either by querying the string "gene1:gene2" or using the column with Gene1 names and the column with Gene2 names.

What is your expected output? Please make input and output reproducible: https://stackoverflow.com/q/5963269/4552295 — s_baldur, Sep 03 '18 at 13:45
Please be clear on **EXAMPLE DATA** and **DESIRED OUTCOME**. — Andre Elrico, Sep 03 '18 at 13:54

score 0 · Answer 1 · answered Sep 03 '18 at 13:49

I don't know if the following puts you close to what you need, but it does solve the problem of matching the strings.

Dataset1 <- data.frame(Edge = c("A:B", "D:E"))
Dataset2 <- data.frame(Edge = c("A:B", NA))
Dataset3 <- data.frame(Edge = c("A:B", "D:E"))

splitSort <- function(x, split = ":"){
  x <- as.character(x)
  x <- strsplit(x, split)
  x <- lapply(x, function(y) paste(sort(y), collapse = split))
  unlist(x)
}

e1 <- splitSort(Dataset1$Edge)
e2 <- splitSort(Dataset2$Edge)
e3 <- splitSort(Dataset3$Edge)
r <- Reduce(function(x, y) intersect(x, y), list(e1, e2, e3))

i <- which(Dataset2$Edge %in% r)
Dataset2[i, , drop = FALSE]
#  Edge
#1  A:B

score 0 · Answer 2 · answered Sep 03 '18 at 13:53

I have no clue what you want. Here is my try. Maybe it helps you if you just order you genes the same way.

df1 <- 
    structure(list(Dataset1 = c("B:A", "E:A"), Dataset2 = c("A:B", 
                                                            "A:E"), Dataset3 = c("A:B", "A:B"), Dataset4 = c("B:A", "E:A"
                                                            ), Dataset5 = c("A:B", "B:A")), row.names = c(NA, -2L), class = "data.frame")
#      Dataset1 Dataset2 Dataset3 Dataset4 Dataset5
#1      B:A      A:B      A:B      B:A      A:B
#2      E:A      A:E      A:B      E:A      B:A

library(magrittr)
fun1 <- function(x) {
    strsplit(x,":") %>% lapply(sort) %>% lapply(paste0,collapse=":") %>% unlist
}

df1[] %<>% lapply(fun1)

df1
#  Dataset1 Dataset2 Dataset3 Dataset4 Dataset5
#1      A:B      A:B      A:B      A:B      A:B
#2      A:E      A:E      A:B      A:E      A:B

How to match columns or strings in R regardless of order

2 Answers2