1

I have two .csv files containing correlation matrices exported from R. One file contains the P-values and one contains the r-values. The row and column headers match exactly between the two files.

I am trying to extract the r-values and corresponding row and column header for pairs only when the P-value < 0.05. Here is a sample of what the data in the r-value input file looks like (I have 1700+ correlated items, rather than only the two shown):

            Species1                 Species2
Species1      1                       0.9
Species2      0.9                     1

The P-value input file is identical, except containing P-values in place of r-values.

I am relatively new to Python, and am not sure how to handle files of this type. I have tried a few strategies, including using the csv library to iterate through the files. I looked into using numpy, but it doesn't seem that it will work for me (?). I also looked into using scipy to calculate r- and P-values (Pearsons) in Python, but it seems that this only works for comparing two one dimensional arrays (I have 1700+ columns of data to correlate).

Code I am starting with, to show you what I have imported:

import csv
infileP = open('AllcorrP.csv', 'rU')
infileR = open('AllcorrR.csv', 'rU')

The question Can anyone help me extract the column and row headers and r-values from my r-value file based on significant (< 0.05) P-values from my p-value file?

OR

Calculate the r- and P-values for all possible correlations between many columns of data directly using Python and extract only the results with significant P-values?

In the end, I would like output in two files.
First file:

Species1   Species2   Species4  ...
Species2   Species1   Species7  ...

etc...(where "Species1" is the first species with significant correlations and the next items on the line are the species that it significantly correlated with (Species2, Species4 etc.)

Second file:

Species1 (corr) Species2 = 0.87
Species2 (corr) Species7 = 0.72
...

etc. which shows each pairwise correlation and the r-value that goes with it

At this point, I'd be happy to just be able to extract a list of the r-values and species that I want and figure out the final two file formatting later. Thank you!

python4ecology
  • 275
  • 1
  • 3
  • 7

2 Answers2

1

To read the data, you should be able to use numpy.genfromtext. See the documentation, there is a ton of functionality within this function. To read your example above, you might do:

from numpy import genfromtxt
rdata = genfromtxt('AllcorrR.csv', skip_header=1)[:,1:]
Pdata = genfromtxt('AllcorrP.csv', skip_header=1)[:,1:]

The [:,1:] is to ignore the first column of data when read in. The function doesn't have an input to "ignore the first x columns" like it does for rows (via skip_header). Not sure why they didn't implement this, it always bugged me.

This would just read the data for P (can also do this for r). Then you can filter the data pretty easily. You could read in the first row and column separated to get the headings. Or if you see the genfromtxt documentation, you could also name them (create a recarray).

To find the indices (values) where r is less then 0.50, you can simply do a comparison and numpy automagically creates a boolean array for you:

print Pdata < 0.05

This can be used as an index into rdata (make sure there are the same number of rows/columns):

print rdata[Pdata < 0.05]
Scott B
  • 2,542
  • 7
  • 30
  • 44
0

You can do something like this to get a list of tuples, containing the row and column headers, and the r and P values of the data elements you're interested in:

infile_r = open('AllcorrR.csv', 'r')
infile_p = open('AllcorrP.csv', 'r')

# Read the first line of each file.
line_r = infile_r.readline()
line_p = infile_p.readline()

# Set the separator depending on the file format.
SEPARATOR = None  # Elements separated by whitespace.
column_headers = line_r.split(SEPARATOR)

significant = []

# Read the rest of the lines.
for line_r in infile_r:
    line_p = infile_p.readline()
    tokens_r = line_r.split(SEPARATOR)
    tokens_p = line_p.split(SEPARATOR)
    row_header = tokens_r[0]
    values_r = [float(v) for v in tokens_r[1:]]
    values_p = [float(v) for v in tokens_p[1:]]
    significant.extend([(row_header, column_header, r, p) for column_header, r, p in zip(column_headers, values_r, values_p) if p < 0.05])

print significant
Dan Gerhardsson
  • 1,869
  • 13
  • 12