1

I have a datafile of ~375 cell lines and ~14,000 genes. I'm attempting to compute the pairwise correlations for each gene with every other gene.

Code is very simple as I'm using the pingouin package:

import pandas as pd
import pingouin as pg
df = pd.read_csv("CCLE Proteomics.csv", index_col=0, header=0)
df_corr = df.rcorr(stars=False)
print(df_corr)

Attempting to run this code returns:

ValueError: x and y must have length at least 2.

Pingouin uses Scipy pearsonr to do the calculations, and using pearsonr without Pingouin returns the same error.

I've also tried using a dummy dataset (5x7 dataframe of random numbers) which works fine when it doesn't include any null values, but returns the same error if null values exist within the dataframe. Based on this I believe the null values in my dataset are causing the issue - unfortunately the data is spotty enough that removing ALL rows/columns containing a null value leaves me with no rows/columns left, and in the dummy data set even one NaN value is enough to throw the error. As rcorr removes NaN values before feeding in to pearsonr, I believe it's dropping all my datapoints and having nothing left to feed in.

df.corr can calculate my r-values just fine, but I'm in need of a method to calculate p-values for this dataset as well, as we expect a significant number of these correlations to be insignificant.

Is there a way I can drop/mask NaN values within my dataset without dropping entire rows/columns? Is there a way to run pearsonr that behaves similarly to spearmanr with (nan_policy:'omit')? Am I off base and it's not the NaN values that are the issue here?

JGamma
  • 11
  • 2

0 Answers0