13

I am working with large biological dataset.

I want to calculate PCC(Pearson's correlation coefficient) of all 2-column combinations in my data table and save the result as DataFrame or CSV file.

Data table is like below:columns are the name of genes, and rows are the code of dataset. The float numbers mean how much the gene is activated in the dataset.

      GeneA GeneB GeneC ...
DataA 1.5 2.5 3.5 ...
DataB 5.5 6.5 7.5 ...
DataC 8.5 8.5 8.5 ...
...

As a output, I want to build the table(DataFrame or csv file) like below, because scipy.stats.pearsonr function returns (PCC, p-value). In my example, XX and YY mean the results of pearsonr([1.5, 5.5, 8.5], [2.5, 6.5, 8.5]). In the same way, ZZ and AA mean the result of pearsonr([1.5, 5.5, 8.5], [3.5, 7.5, 8.5]). I do not need the redundant data such as GeneB_GeneA or GeneC_GeneB in my test.

               PCC P-value
GeneA_GeneB    XX YY
GeneA_GeneC    ZZ AA
GeneB_GeneC    BB CC
...

As the number of columns and rows are many(over 100) and their names are complicated, using column names or row names will be difficult.

It might be a simple problem for experts, I do not know how to deal with this kind of table with python and pandas library. Especially making new DataFrame and adding result seems to be very difficult.

Sorry for my poor explanation, but I hope someone could help me.

Stefan
  • 41,759
  • 13
  • 76
  • 81
z991
  • 713
  • 1
  • 9
  • 21
  • This is answered here: [link](http://stackoverflow.com/questions/3949226/calculating-pearson-correlation-and-significance-in-python) – Glostas Nov 30 '15 at 11:55
  • Thank you for your comment. I think the title was not good enough. What I want to know is not how to calculate PCC, but calculating PCC of all columns pair, and save the results as a new DataFrame. – z991 Nov 30 '15 at 12:01

4 Answers4

21
from pandas import *
import numpy as np
from libraries.settings import *
from scipy.stats.stats import pearsonr
import itertools

Creating random sample data:

df = DataFrame(np.random.random((5, 5)), columns=['gene_' + chr(i + ord('a')) for i in range(5)]) 
print(df)

     gene_a    gene_b    gene_c    gene_d    gene_e
0  0.471257  0.854139  0.781204  0.678567  0.697993
1  0.292909  0.046159  0.250902  0.064004  0.307537
2  0.422265  0.646988  0.084983  0.822375  0.713397
3  0.113963  0.016122  0.227566  0.206324  0.792048
4  0.357331  0.980479  0.157124  0.560889  0.973161

correlations = {}
columns = df.columns.tolist()

for col_a, col_b in itertools.combinations(columns, 2):
    correlations[col_a + '__' + col_b] = pearsonr(df.loc[:, col_a], df.loc[:, col_b])

result = DataFrame.from_dict(correlations, orient='index')
result.columns = ['PCC', 'p-value']

print(result.sort_index())

                     PCC   p-value
gene_a__gene_b  0.461357  0.434142
gene_a__gene_c  0.177936  0.774646
gene_a__gene_d -0.854884  0.064896
gene_a__gene_e -0.155440  0.802887
gene_b__gene_c -0.575056  0.310455
gene_b__gene_d -0.097054  0.876621
gene_b__gene_e  0.061175  0.922159
gene_c__gene_d -0.633302  0.251381
gene_c__gene_e -0.771120  0.126836
gene_d__gene_e  0.531805  0.356315
  • Get unique combinations of DataFrame columns using itertools.combination(iterable, r)
  • Iterate through these combinations and calculate pairwise correlations using scipy.stats.stats.personr
  • Add results (PCC and p-value tuple) to dictionary
  • Build DataFrame from dictionary

You could then also save result.to_csv(). You might find it convenient to use a MultiIndex (two columns containing the names of each columns) instead of the created names for the pairwise correlations.

Stefan
  • 41,759
  • 13
  • 76
  • 81
  • Thank you very much! As you and ChenZhongPu advised, using combination function seems to be a good solution for this kind of problem. Also I would like to thank you again for your kind explanations. It was very helpful because I am new at python. – z991 Nov 30 '15 at 15:35
  • sorry my data is quite large, so it's very very slow, and I get `MemoryError: Unable to allocate 15.8 GiB for an array with shape (46063, 46063) and data type float64`, any ideas to deal with this issue? – ah bon Aug 29 '21 at 16:13
  • The above solution computes correlations pair by pair precisely to avoid creating the full correlation matrix. – Stefan Aug 31 '21 at 10:35
5

A simple solution is to use the pairwise_corr function of the Pingouin package (which I created):

import pingouin as pg
pg.pairwise_corr(data, method='pearson')

This will give you a DataFrame with all combinations of columns, and, for each of those, the r-value, p-value, sample size, and more.

There are also a number of options to specify one or more columns (e.g. one-vs-all behavior), as well as covariates for partial correlation and different methods to calculate the correlation coefficient. Please see this example Jupyter Notebook for a more in-depth demo.

Raphael
  • 499
  • 5
  • 6
4

Assuming the data you have is in a pandas DataFrame.

df.corr('pearson')  # 'kendall', and 'spearman' are the other 2 options

will provide you a correlation matrix between each column.

Metehan
  • 729
  • 5
  • 22
  • Sorry, I get a `MemoryError: Unable to allocate 15.8 GiB for an array with shape (46063, 46063) and data type float64`, any ideas to deal with this issue? Thanks. – ah bon Aug 29 '21 at 16:10
  • Do you really need to check the correlation between 46063 columns? If not then create a new dataframe with only the columns that you want to check the correlation. – Metehan Aug 30 '21 at 22:03
  • Yes, I would like to check if it's possible doing so. – ah bon Aug 31 '21 at 01:06
  • 1
    You can then try to convert the data types to float32 or something with lower space requirement than float64. – Metehan Sep 01 '21 at 04:58
3

To get pairs, it is a combinations problem. You can concat all the rows into one the result dataframe.

from pandas import *
from itertools import combinations
df = pandas.read_csv('gene.csv')
# get the column names as list, which are gene names
column_list = df.columns.values.tolist()
result = []
for c in combinations(column_list, 2):
    firstGene, secondGene = c
    firstGeneData = df[firstGene].tolist()
    secondGeneData = df[secondGene].tolist()
    # now to get the PCC, P-value using scipy
    pcc = ...
    p-value = ...
    result.append(pandas.DataFrame([{'PCC': pcc, 'P-value': p-value}], index=str(firstGene)+ '_' + str(secondGene), columns=['PCC', 'P-value'])

result_df = pandas.concat(result)
#result_df.to_csv(...)
chenzhongpu
  • 6,193
  • 8
  • 41
  • 79
  • I did not know about 'combinations', but it looks like nice when doing this kind of pair calculation. Also, I learned that making DataFrame from list can be easily don by concat function. Thank you very much! – z991 Nov 30 '15 at 15:32