0

Say, I have a data frame of dimension (74, 3234), 74 rows, and 3234 columns. I have a function to run a correlation analysis. However, when I give this data frame as it is, it is taking forever to print the results. Now I would like to split the data frame into multiple chunks. And use the chucks in the function.

The data frame has 20,000 columns with the column names containing string _PC and 15000 columns with string _lncRNAs.

The condition which needs to follow is, I what I need to split the data frame into multiple smaller dataframe, which contain both columns with _PC and _lncRNAs column names. For example df1 must contain 500 columns with _PC and 500 columns with _lncRNAs strings.

I envision having multiple data frames. For example always 74 rows, but using consecutive column . for instance, 1-500, 501-1000, 10001 -1500, 1501-2000, so on until the last column

 `df1.shape`
(74, 500)
df2.shape
(74, 500)

... so on

one example

df1.head()
sam   END_PC  END2_PC END3_lncRNAs END4_lncRNAs
SAP1    50.9   30.4   49.0          50
SAP2      6    8.9     12.4 39.8   345.9888

Then, I need to use each split data frame on the following function.

def correlation_analysis(lncRNA_PC_T):
    """
    Function for correlation analysis
    """
    correlations = pd.DataFrame()
    for PC in [column for column in lncRNA_PC_T.columns if '_PC' in column]: 
        for lncRNA in [column for column in lncRNA_PC_T.columns if '_lncRNAs' in column]:
                    correlations = correlations.append(pd.Series(pearsonr(lncRNA_PC_T[PC],lncRNA_PC_T[lncRNA]),index=['PCC', 'p-value'],name=PC + '_' +lncRNA))
    correlations.reset_index(inplace=True)
    correlations.rename(columns={0:'name'},inplace=True)
    correlations['PC']         = correlations['index'].apply(lambda x:x.split('PC')[0])
    correlations['lncRNAs']    = correlations['index'].apply(lambda x:x.split('PC')[1])
    correlations['lncRNAs']    = correlations['lncRNAs'].apply(lambda x:x.split('_')[1])
    correlations['PC']         = correlations.PC.str.strip('_')
    correlations.drop('index',axis=1,inplace=True)
    correlations               = correlations.reindex(columns=['PC','lncRNAs','PCC','p-value']) 
              
    return(correlations)

For each, data frame output should look like this,

              gene          PCC   p-value
END_PC_END3_lncRNAs  -0.042027   0.722192
END2_PC_END3_lncRNAs  -0.017090   0.885088
END_PC_END4_lncRNAs    0.001417    0.990441
END2_PC_END3_lncRNAs  -0.041592   0.724954

I know one can split based on rows like this,

n = 200000  #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]

I want something like this based on columns. Any suggestions or help is much appreciated. Thanks

ARJ
  • 2,021
  • 4
  • 27
  • 52
  • Transpose the dataframe `df.T`. And then apply your analysis on a chunk of rows in the transposed dataframe. – CypherX Jul 16 '20 at 18:27
  • yes, I tried that. That will not full fill the condition mentioned in the question. For example, the first 20000 rows are with string_PC and then are rows with `lncRNAs` string. – ARJ Jul 16 '20 at 18:32
  • You might consider using Modin/VAEX/Dask library on tip of pandas. – CypherX Jul 16 '20 at 18:38
  • What are you trying to look for correlation among? The rows or the columns? If it's the rows, the output will be (74x74) and if it's the columns, it will be of shape (3234x3234) ~ 9x10^6 ~ 10 million. – CypherX Jul 16 '20 at 18:51
  • For the transposed data frame, I am looking for correlation between columns for each row. I have updated the question with the function for correlation analysis. – ARJ Jul 16 '20 at 18:55
  • @user1017373 , IIUC, there are more `_PC` columns than `_lncRNAs` columns, what happens when the `_lncRNAs` columns are all used? it keeps creating chunks with the ones `_PC` columns that were left over? – MrNobody33 Jul 16 '20 at 18:55
  • Yes, That is another issue. If I go with the whole data frame (the entire input at once). The function takes forever, to give me the output. So I thought splitting would help. I not sure, which would be a better approach here, in terms of computation and speed. At the same time, not loosing the data – ARJ Jul 16 '20 at 18:59
  • I am not clear what you meant here: *For the transposed data frame, I am looking for correlation between columns for each row.* Do you mean the correlation between the columns of the transposed dataframe? Or, between the columns of the original (non-transposed) dataframe? – CypherX Jul 16 '20 at 19:04
  • Try running `df.corr()` and see how long that takes. For a dataframe of shape (74, 3000) it only took around 3-4 seconds. – CypherX Jul 16 '20 at 19:07
  • Do you mean the correlation between the columns of the transposed data frame? Yes, the correlation between the column of the transposed data frame. – ARJ Jul 16 '20 at 19:09
  • Did you try with `df.T.corr()` – MrNobody33 Jul 16 '20 at 19:10
  • Yes, for a data frame of `(35, 3234)` [split one], with my above function it took me 15 minutes – ARJ Jul 16 '20 at 19:11
  • @CypherX, Do you mind sharing the exact code you tried for finishing the job in 3-4 seconds? – ARJ Jul 16 '20 at 19:19

4 Answers4

0

How about df.iloc?

And use df.shape[1] for the number of columns:

list_df = [df.iloc[:, i:i+n] for i in range(0, df.shape[1], n)]

Ref: How to take column-slices of dataframe in pandas

Basil
  • 659
  • 4
  • 11
  • Thanks,! for the reply. the question is updated now. The question has one more additional part of it.Sorry for that – ARJ Jul 16 '20 at 18:20
0

It's just like wrote Basil but using pandas.DataFrame.iloc

I do not know what are the columns labels. So in order to make this independent of the index or column labels, is better to use:

list_df = [df.iloc[:,i:i+n] for i in range(0, df.shape[1], n)]

See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html

Flor
  • 1
  • I have updated the questions. Now column split should be based on the mentioned condition. – ARJ Jul 16 '20 at 18:25
0

This is what I tried to see how long it takes to evaluate the correlation between the rows and the columns of the dataframe (df). It took under 50 milliseconds for rows-correlation and under 2 seconds for columns-correlation.

  • Rows-correlation output-shape: (74x74)
  • Columns-correlation output-shape: (3000x3000)

Correlation b/w some target-columns and all-columns

# Define target columns
target_cols = ['C0', 'C12', 'C100']

# Extract correlation-result for target-columns
corr_result = df_corr[target_cols]
print(corr_result)

Output:

             C0       C12      C100
C0     1.000000 -0.031120 -0.221829
C1     0.064772 -0.130507 -0.086164
C2     0.077853 -0.116949  0.003468
C3     0.070557 -0.013551  0.007093
C4     0.165782 -0.058755 -0.175888
...         ...       ...       ...
C2995 -0.097033 -0.014391  0.018961
C2996  0.099591  0.017187 -0.016138
C2997 -0.126288  0.145150 -0.089306
C2998  0.033484  0.054106 -0.006594
C2999 -0.154657  0.020002 -0.104889

Dummy Data

import numpy as np
import pandas as pd

## Create Dummy Data
a = np.random.rand(74, 3000)
print(f'a.shape: {a.shape}')

## Create Dataframe
index = [f'R{i}' for i in range(a.shape[0])]
columns = [f'C{i}' for i in range(a.shape[1])]
df = pd.DataFrame(a, columns=columns, index=index)
df.shape # (74, 3000)

Evaluate Correlation

I did the following in a jupyter notebook

## Correlation between Rows of dfp
%%time
df.T.corr()
#CPU times: user 39.5 ms, sys: 1.09 ms, total: 40.6 ms
#Wall time: 41.3 ms

## Correlation between Columns of dfp
%%time
df.corr()
# CPU times: user 1.64 s, sys: 34.6 ms, total: 1.67 s
# Wall time: 1.67 s

Output: df.corr()

Since, the shape of the dataframe was (74, 3000), df.corr() yields a dataframe of shape (3000, 3000).

             C0        C1        C2  ...     C2997     C2998     C2999
C0     1.000000  0.064772  0.077853  ... -0.126288  0.033484 -0.154657
C1     0.064772  1.000000  0.031059  ...  0.064317  0.095075 -0.100423
C2     0.077853  0.031059  1.000000  ... -0.123791 -0.034085  0.052334
C3     0.070557  0.229482  0.047476  ...  0.043630 -0.055772  0.037123
C4     0.165782  0.189635 -0.009193  ... -0.123917  0.097660  0.074777
...         ...       ...       ...  ...       ...       ...       ...
C2995 -0.097033 -0.126214  0.051592  ...  0.008921 -0.004141  0.221091
C2996  0.099591  0.030975 -0.081584  ...  0.186931  0.084529  0.063596
C2997 -0.126288  0.064317 -0.123791  ...  1.000000  0.061555  0.024695
C2998  0.033484  0.095075 -0.034085  ...  0.061555  1.000000  0.195013
C2999 -0.154657 -0.100423  0.052334  ...  0.024695  0.195013  1.000000
CypherX
  • 7,019
  • 3
  • 25
  • 37
  • Thanks for sharing. However, the output I am looking for is the correlation between each column with `_PC` versus the columns with `_lncRNAs` string. As I have mentioned in my primary question – ARJ Jul 16 '20 at 19:36
  • @user1017373 Perhaps this will help. Here it is, as you asked. – CypherX Jul 16 '20 at 19:36
  • Yes, What I need is to split the dataframe and then try the correlation. Because my entire data frame is really huge – ARJ Jul 16 '20 at 19:39
  • What you want, as I understand is the correlation between all the columns and a subset of the columns. Right? If this is the case, the result you get from `df.corr()`, you could easily choose the subset of it to fit your requirement. – CypherX Jul 16 '20 at 19:40
  • The results from your script generate the correlations core between rows, as I understand. But what I need is the correlation score between columns. Meaning that between the names of the columns with `_PC` string and `_lncRNAs` string. ANyways I appreciate your help. I am not able to your code for my requirement. – ARJ Jul 16 '20 at 19:50
  • The out I am looking for should look like this at the end, (for the given input in my question.). Like I said in my question, above – ARJ Jul 16 '20 at 19:52
  • I have bolded the output section in my question above. – ARJ Jul 16 '20 at 19:54
  • You keep talking about `_PC` and `_lncRNAs`: but they are essentially columns. See how I calculated the correlation between (a subset of the columns) and (all-columns). – CypherX Jul 16 '20 at 19:55
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/217994/discussion-between-cypherx-and-user1017373). – CypherX Jul 16 '20 at 19:57
0

If you want the correlation between each column with _PC versus the columns with _lncRNAs string, you could try something like this:

df_pc=df.filter(like='_PC')
df_lncRNAs=df.filter(like='_lncRNAs')
pd.concat([df_pc, df_lncRNAs], axis=1, keys=['df1', 'df2']).corr().loc['df2', 'df1']

Example:

import pandas as pd
df = pd.DataFrame({"a_pc":[1,2,3,4,5,6],
                  "b_pc":[3,210,12,412,512,61]
                   ,"c_pc": [1,2,3,4,5,6]
                 ,"d_lncRNAs": [3,210,12,412,512,61]
                 ,"d1_lncRNAs": [3,210,12,412,512,61]})

df_pc=df.filter(like='_pc')
df_lncRNAs=df.filter(like='_lncRNAs')
correlation=pd.concat([df_pc, df_lncRNAs], axis=1, keys=['df1', 'df2']).corr().loc['df2', 'df1']
correlation

Output:

df
   a_pc  b_pc  c_pc  d_lncRNAs  d1_lncRNAs
0     1     3     1          3           3
1     2   210     2        210         210
2     3    12     3         12          12
3     4   412     4        412         412
4     5   512     5        512         512
5     6    61     6         61          61

df_pc
   a_pc  b_pc  c_pc
0     1     3     1
1     2   210     2
2     3    12     3
3     4   412     4
4     5   512     5
5     6    61     6

df_lncRNAs 
   d_lncRNAs  d1_lncRNAs
0          3           3
1        210         210
2         12          12
3        412         412
4        512         512
5         61          61

correlation
                a_pc  b_pc      c_pc
d_lncRNAs   0.392799   1.0  0.392799
d1_lncRNAs  0.392799   1.0  0.392799
MrNobody33
  • 6,413
  • 7
  • 19
  • Thanks for the solution. The script I posted in my primary question is also calculating correlations. But the problem is, my input data frames are huge (eg. dimension: 3246, 307869), and so it is taking forever. Hence, I need a solution where the input can be conditionally split and send to the correlation function. – ARJ Jul 16 '20 at 20:14