Calculate pairwise spearman's rank correlation from data present in all files in a directory

Question

I'm trying to calculate Spearman's rank correlation, where the data (tsv with name and rank) for each experiment is stored in separate files in a directory.

Following is the format of input files:

#header not present
#geneName   value
ENSMUSG00000026179.14   14.5648627685587
ENSMUSG00000026179.14   0.652158034413075
ENSMUSG00000026179.14   0.652158034413075
ENSMUSG00000026179.14   1.852158034413075
ENSMUSG00000026176.13   4.13033421794948
ENSMUSG00000026176.13   4.13033421794948
ENSMUSG00000026176.13   15.4344068144428
ENSMUSG00000026176.13   15.4344068144428
ENSMUSG00000026176.13   6.9563523670728
...

My problem is that the keys(gene names) are repetitive, and each experiment file contains different but overlapping set of gene names. What I need is an intersection of gene names for each pair while performing the correlation and removing duplicates, probably something like this pseudo code:

# Find correlation for all possible pairs of input(i.e. files in directory)
files = list_Of_files("directory")
for(i in files) {
    for(k in files) {
    CommonGenes <- intersect (i,k)
    tempi <- removeRepetitive(i, CommonGenes) #Keep the gene with highest value and remove all other repeating genes. Also, keep only common genes.
    tempk <- removeRepetitive(k, CommonGenes) #Keep the gene with highest value and remove all other repeating genes. Also, keep only common genes. 
    correlationArray[] <- spearman(tempi, tempk) #Perform correlation for only the common genes
}
}

Ultimately, I want to plot the correlation matrix using corrplot or qtlcharts.

@ycw, I'm sorry. I usually work with python, so I find writing dummy examples in python "like" format easier. I will update my question to reflect that. — Siddharth, Jul 16 '17 at 23:02

zx8754 · Accepted Answer · 2017-07-17T07:01:06.397

First, read all the data into a list of dataframes, see this post for more info, here we are just creating a dummy data.

library(dplyr)

# dummy data
set.seed(1)
myDfs <- list(
  data.frame(geneName = sample(LETTERS[1:4], 15, replace = TRUE), value = runif(15)),
  data.frame(geneName = sample(LETTERS[1:4], 15, replace = TRUE), value = runif(15)),
  data.frame(geneName = sample(LETTERS[1:4], 15, replace = TRUE), value = runif(15)),
  data.frame(geneName = sample(LETTERS[1:4], 15, replace = TRUE), value = runif(15)),
  data.frame(geneName = sample(LETTERS[1:4], 15, replace = TRUE), value = runif(15))
)

Then, just like your two nested for loops, what we have here is two nested apply functions. Within loops we are aggregating and getting correlation on matched merged genes names.

res <- sapply(myDfs, function(i){
  # group by gene, get max value
  imax <- i %>% group_by(geneName) %>% summarise(i_Max = max(value))
  sapply(myDfs, function(j){
    # group by gene, get max value
    jmax <- j %>% group_by(geneName) %>% summarise(j_Max = max(value))
    # get overlapping genes
    ij <- merge(imax, jmax, by = "geneName")
    # return correlation
    cor(ij$i_Max, ij$j_Max, method = "spearman")
  })
})

res will have the correlation matrix.

res

#      [,1] [,2] [,3] [,4] [,5]
# [1,]  1.0 -0.2  1.0  0.4 -0.4
# [2,] -0.2  1.0 -0.2  0.8  0.0
# [3,]  1.0 -0.2  1.0  0.4 -0.4
# [4,]  0.4  0.8  0.4  1.0 -0.4
# [5,] -0.4  0.0 -0.4 -0.4  1.0

For correlation plot there are many alternatives to choose from. Here as an example we are using corrplot:

corrplot::corrplot(res)

Thanks! But I encountered another problem. The solution works, but I cannot define the labels based on file names. The current method disregards any sort of labels all together, but keeps the order of dataframes. I can use that information to label the rows and columns manually, but I was wondering if there is a way to do it automatically ? — Siddharth, Jul 17 '17 at 11:53
@Siddharth avoid asking new questions, I think you have already answered your own question anyway. Yes, keeps the order, you could add manually to the plot, or give a named object to corrplot. — zx8754, Jul 17 '17 at 12:15
sorry for that! I managed to that and modified the code to generate scatter-plot for each pair. Thanks for the help! — Siddharth, Jul 17 '17 at 12:27

score 1 · Answer 2 · answered Jul 20 '17 at 13:57

Here’s an alternative solution. Rather than having a nested loop, it uses expand.grid to create the combinations, and then uses a pipeline of ‹dplyr› verbs to calculate correlations on a subset of the master table.

This approach has both advantages and disadvantages. Foremost it fits nicely into the “tidy data” approach, and there are some who advocate to work in tidy data as much as possible. The actual code is about as long as zx8754’s.

library(dplyr)

genes = sprintf('ENSMUSG%011d', 1 : 50)
my_dfs = replicate(4, tibble(Gene = sample(genes, 20, replace = TRUE), Value = runif(20)),
                   simplify = FALSE)

First off we want to make the gene names unique because everything subsequently requires unique genes per table:

my_dfs = lapply(my_dfs, function (x) summarize(group_by(x, Gene), Value = max(Value)))

Now we can create all permutations of this list:

combinations = bind_cols(expand.grid(i = seq_along(my_dfs), j = seq_along(my_dfs)),
                         expand.grid(x = my_dfs, y = my_dfs))

At this point, we have a table with the indices of all pairwise combinations i, j, as well as the combinations themselves as list columns:

# A tibble: 16 x 4
       i     j                 x                 y
   <int> <int>            <list>            <list>
 1     1     1 <tibble [17 x 2]> <tibble [17 x 2]>
 2     2     1 <tibble [18 x 2]> <tibble [17 x 2]>
 3     3     1 <tibble [19 x 2]> <tibble [17 x 2]>
…

We now group by the indices and join the single list columns in each group by gene names:

correlations = combinations %>%
    group_by(i, j) %>%
    do(inner_join(.$x[[1]], .$y[[1]], by = 'Gene')) %>%
    print() %>%
    summarize(Cor = cor(Value.x, Value.y, method = 'spearman'))

Intermission: at the print() line we are left with a fully-expanded table of all pairwise combinations of all gene tables (the Value columns of the two original tables have been renamed into Value.x and Value.y, respectively):

# A tibble: 182 x 5
# Groups:   i, j [16]
       i     j               Gene    Value.x    Value.y
   <int> <int>              <chr>      <dbl>      <dbl>
 1     1     1 ENSMUSG00000000014 0.93470523 0.93470523
 2     1     1 ENSMUSG00000000019 0.21214252 0.21214252
 3     1     1 ENSMUSG00000000028 0.65167377 0.65167377
 4     1     1 ENSMUSG00000000043 0.12555510 0.12555510
 5     1     1 ENSMUSG00000000010 0.26722067 0.26722067
 6     1     1 ENSMUSG00000000041 0.38611409 0.38611409
 7     1     1 ENSMUSG00000000042 0.01339033 0.01339033
…

The next line trivially calculates pairwise correlations from these tables, using the same groups. Since the whole table is in long format, it can be conveniently plotted with ‹ggplot2›:

library(ggplot2)

ggplot(correlations) +
    aes(i, j, color = Cor) +
    geom_tile() +
    scale_color_gradient2()

… but if you need this as a square correlation matrix instead, nothing is easier:

corr_mat = with(correlations, matrix(Cor, nrow = max(i)))

      [,1]  [,2]  [,3]  [,4]
[1,]  1.00  1.00 -0.20 -0.26
[2,]  1.00  1.00 -0.43 -0.50
[3,] -0.20 -0.43  1.00 -0.90
[4,] -0.26 -0.50 -0.90  1.00

Calculate pairwise spearman's rank correlation from data present in all files in a directory

2 Answers2