Say, I have a data frame of dimension (74, 3234), 74 rows, and 3234 columns. I have a function to run a correlation analysis. However, when I give this data frame as it is, it is taking forever to print the results. Now I would like to split the data frame into multiple chunks. And use the chucks in the function.
The data frame has 20,000 columns with the column names containing string _PC
and 15000 columns with string _lncRNAs
.
The condition which needs to follow is,
I what I need to split the data frame into multiple smaller dataframe, which contain both columns with _PC
and _lncRNAs
column names. For example df1
must contain 500 columns with _PC
and 500 columns with _lncRNAs
strings.
I envision having multiple data frames. For example always 74 rows, but using consecutive column . for instance, 1-500, 501-1000, 10001 -1500, 1501-2000,
so on until the last column
`df1.shape`
(74, 500)
df2.shape
(74, 500)
... so on
one example
df1.head()
sam END_PC END2_PC END3_lncRNAs END4_lncRNAs
SAP1 50.9 30.4 49.0 50
SAP2 6 8.9 12.4 39.8 345.9888
Then, I need to use each split data frame on the following function.
def correlation_analysis(lncRNA_PC_T):
"""
Function for correlation analysis
"""
correlations = pd.DataFrame()
for PC in [column for column in lncRNA_PC_T.columns if '_PC' in column]:
for lncRNA in [column for column in lncRNA_PC_T.columns if '_lncRNAs' in column]:
correlations = correlations.append(pd.Series(pearsonr(lncRNA_PC_T[PC],lncRNA_PC_T[lncRNA]),index=['PCC', 'p-value'],name=PC + '_' +lncRNA))
correlations.reset_index(inplace=True)
correlations.rename(columns={0:'name'},inplace=True)
correlations['PC'] = correlations['index'].apply(lambda x:x.split('PC')[0])
correlations['lncRNAs'] = correlations['index'].apply(lambda x:x.split('PC')[1])
correlations['lncRNAs'] = correlations['lncRNAs'].apply(lambda x:x.split('_')[1])
correlations['PC'] = correlations.PC.str.strip('_')
correlations.drop('index',axis=1,inplace=True)
correlations = correlations.reindex(columns=['PC','lncRNAs','PCC','p-value'])
return(correlations)
For each, data frame output should look like this,
gene PCC p-value
END_PC_END3_lncRNAs -0.042027 0.722192
END2_PC_END3_lncRNAs -0.017090 0.885088
END_PC_END4_lncRNAs 0.001417 0.990441
END2_PC_END3_lncRNAs -0.041592 0.724954
I know one can split based on rows like this,
n = 200000 #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]
I want something like this based on columns. Any suggestions or help is much appreciated. Thanks