0

I am new to programming and have just started learning R and hence a request to please bear with my ignorance. I am currently working with data that looks like the following:

I have data in the following format.

For eg:

Disease Gene Symbol
Disease A FOXJ1
Disease B MYB
Disease B GATA4
Disease C MYB
Disease D GATA4

There are some 250 such entries. I would like to see the data in the following format:

Disease 1 Common Shared Gene Symbols Disease 2

Disease A MYB,FOXJ1 Disease B

Disease C MYB Disease B

Disease B GATA4 Disease D

The way I was approaching this : I split the process into 3 steps:

Step 1: Make pairwise combinations of the Diseases.

Step 2: Find gene symbols that are associated with each Disease and assign them to a vector.

Step 3: Now use the intersect (%n%) function on these created vectors to find shared gene symbols.

I am sure there must be something much simpler than this.

Any help will be appreciated! Thank you very much!

Regards, S

Community
  • 1
  • 1
DataStudent
  • 21
  • 1
  • 3
  • 2
    Welcome to stack overflow. Please try to make a reproducible example of your situation. You can read http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Chargaff Oct 24 '13 at 19:29

1 Answers1

3

A solution, using combinat package, is:

library(combinat)

#random data
DF <- data.frame(Disease = LETTERS[1:10], Gene = sample(letters[1:4], 10, T))

#> DF
#   Disease Gene
#1        A    a
#2        B    a
#3        C    c
#4        D    b
#5        E    d
#6        F    b
#7        G    c
#8        H    d
#9        I    b
#10       J    d

#all possible combinations of diseases
dis_combns <- combn(DF$Disease, 2)  #see `?combn`

#find common genes between each pair of diseases
commons <- apply(dis_combns, 2, 
       function(x) union(DF$Gene[DF$Disease == x[1]], DF$Gene[DF$Disease == x[2]])) 
#format the list of common genes for easier manipulation later
commons <- unlist(lapply(commons, paste, collapse = " and "))

#result
resultDF <- data.frame(Disease1 = dis_combns[1,], 
                     Common_genes = commons, Disease2 = dis_combns[2,])

#> resultDF
#   Disease1 Common_genes Disease2
#1         A            a        B
#2         A      a and c        C
#3         A      a and b        D
#4         A      a and d        E
#5         A      a and b        F
#6         A      a and c        G
#7         A      a and d        H
#8         A      a and b        I
#9         A      a and d        J
#10        B      a and c        C
#11        B      a and b        D
#12        B      a and d        E
#13        B      a and b        F
#14        B      a and c        G
#....
alexis_laz
  • 12,884
  • 4
  • 27
  • 37