-2

I am not really sure where to start here and I could use some pointers.
I have several objects that are character strings of different lengths containing the names of genes. I want to compare all objects pairwise and get the number of shared genes between each pair of lists (using for instance intersect()). I would like to store all the pairwise comparisons in a matrix to make a heatmap.
But I am not sure how to best perform the comparisons and how to store the results. Should I group all the objects into a dataframe first?

I have 24 objects called names_something:

> length(names_G63)
[1] 4518
> head(names_G63)
[1] "SARC_00002" "SARC_00004" "SARC_00005" "SARC_00012" "SARC_00022" "SARC_00025"
> length(names_C28)
[1] 9190
> head(names_C28)
[1] "SARC_00001" "SARC_00002" "SARC_00003" "SARC_00004" "SARC_00005" "SARC_00008"

And the comparisons would give a single number showing the number of shared genes between lists:

> length(intersect(names_G63, names_C28))
[1] 4097

I want to store these numbers as a matrix, like:

      G63 C28 B124
G63     0
C28  4097   0
B124 3000 345    0
Jon
  • 591
  • 2
  • 8
  • 19
  • Take a look at `cor()` – Stedy Dec 21 '16 at 19:59
  • 2
    In order for us to be able to answer your question, please include a sample of your data by typing `dput(variableName)` for each variable (or a representative subset of the variables) and copying and pasting the console output into your question. Also include the desired output for the data you provide. For more information on how to make a reproducible example in `R` (and make it more likely your question is answered) please view [this post](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – Barker Dec 21 '16 at 19:59
  • If you have two vectors containing strings, vec1 and vec2, then `expand.grid(vec1, vec2)` will produce a data.frame with all pairwise combinations as rows. As an example `expand.grid(letters[1:4], LETTERS[1:5])`. – lmo Dec 21 '16 at 20:02
  • Sorry, I tried to add more details. – Jon Dec 21 '16 at 20:23

1 Answers1

2

I think you are looking for something like this. A matrix that tells you how many genes are shared between expts/sets.

#First a vector of all genes
genes <-unlist(lapply(1:1500, function(x) paste(sample(LETTERS, 5, replace = T), collapse="")))

#Now five pseudo experiments that each generated a set 100 random genes from the set above
geneList <- lapply(1:5, function(x) sample(genes, 100))

#Now we turn the list of genes into a table of expts x genes
genedf <- stack(setNames(geneList, nm=paste("Expt", seq_along(geneList))))

#Use the cross product to count the overlaps
table(genedf[2:1]) %*% t(table(genedf[2:1]))

#         ind
# ind      Expt 1 Expt 2 Expt 3 Expt 4 Expt 5
#   Expt 1    100      8      5      7      7
#   Expt 2      8    100      5      5     10
#   Expt 3      5      5    100      8      4
#   Expt 4      7      5      8    100      8
#   Expt 5      7     10      4      8    100

Edit: to make a list of your gene sets from names_*

geneList<-lapply(ls(pattern="names_"), get)
names(geneList) <- ls(pattern="names_")
genedf <- stack(setNames(geneList, nm=names(geneList)))
table(genedf[2:1]) %*% t(table(genedf[2:1]))
emilliman5
  • 5,816
  • 3
  • 27
  • 37