I'm trying to calculate Spearman's rank correlation, where the data (tsv with name and rank) for each experiment is stored in separate files in a directory.
Following is the format of input files:
#header not present
#geneName value
ENSMUSG00000026179.14 14.5648627685587
ENSMUSG00000026179.14 0.652158034413075
ENSMUSG00000026179.14 0.652158034413075
ENSMUSG00000026179.14 1.852158034413075
ENSMUSG00000026176.13 4.13033421794948
ENSMUSG00000026176.13 4.13033421794948
ENSMUSG00000026176.13 15.4344068144428
ENSMUSG00000026176.13 15.4344068144428
ENSMUSG00000026176.13 6.9563523670728
...
My problem is that the keys(gene names) are repetitive, and each experiment file contains different but overlapping set of gene names. What I need is an intersection of gene names for each pair while performing the correlation and removing duplicates, probably something like this pseudo code:
# Find correlation for all possible pairs of input(i.e. files in directory)
files = list_Of_files("directory")
for(i in files) {
for(k in files) {
CommonGenes <- intersect (i,k)
tempi <- removeRepetitive(i, CommonGenes) #Keep the gene with highest value and remove all other repeating genes. Also, keep only common genes.
tempk <- removeRepetitive(k, CommonGenes) #Keep the gene with highest value and remove all other repeating genes. Also, keep only common genes.
correlationArray[] <- spearman(tempi, tempk) #Perform correlation for only the common genes
}
}
Ultimately, I want to plot the correlation matrix using corrplot or qtlcharts.