0

I have a pandas data frame and I'm trying to find the covariance of the percentage change of each column. For each pair, I want rows with missing values to be dropped, and the percentage be calculated afterwards. That is, I want something like this:

import pandas as pd
import numpy as np

# create dataframe example
N_ROWS, N_COLS = 249, 3535
df = pd.DataFrame(np.random.random((N_ROWS, N_COLS)))
df.iloc[np.random.choice(N_ROWS, N_COLS), np.random.choice(10, 50)] = np.nan

cov_df = pd.DataFrame(index=df.columns, columns=df.columns)
for col_i in df:
    for col_j in df:
        cov = df[[col_i, col_j]].dropna(how='any', axis=0).pct_change().cov()
        cov_df.loc[col_i, col_j] = cov.iloc[0, 1]

The thing is this is super slow. The code below gives me results that is similar (but not exactly) what I want, but it runs quite fast

df.dropna(how='any', axis=0).pct_change().cov()

I am not sure why the second one runs so much faster. I want to speed up my code in the first, but I can't figure out how.

I have tried using combinations from itertools to avoid repeating the calculation for (col_i, col_j) and (col_j, col_i), and using map from multiprocessing to do the computations in parallel, but it still hasn't finished running after 90+ mintues.

Sara
  • 245
  • 4
  • 11
  • It's too late for me to read through properly. Does this help: https://stackoverflow.com/q/52371329/2336654 – piRSquared Apr 08 '22 at 06:10
  • Also this: https://stackoverflow.com/a/38195754/2336654 – piRSquared Apr 08 '22 at 06:15
  • Thanks, looking at them now and trying to see how they can help. Probably won't update before Monday though. – Sara Apr 08 '22 at 06:18
  • @piRSquared I finished reading the second link and I think it's a bit different. The problem posted in the second link seems to be with memory issue. The data was too big so `df.corr()` was slow. On the other hand, my data frame isn't that big and runs fast enough when I just use `df.corr()`, but the result isn't what I want. Still checking out the first link – Sara Apr 11 '22 at 01:22
  • @piRSquared The first link seems to help. It also involves doing it over a loop. I haven't test it out yet. It uses `numba`, but I wonder if there is a way to do this without numba, since `df.corr()` itself is quite fast, and I feel like my calculation isn't doing something that much different from it. – Sara Apr 11 '22 at 01:30

1 Answers1

0

somehow this works fast enough, although I am not sure why

from scipy.stats import pearsonr

corr = np.zeros((x.shape[1], x.shape[1]))
for i in range(x.shape[1]):
    for j in range (i + 1, x.shape[1]):
        y = x[:, [i, j]]
        y = y[~np.isnan(y).any(axis=1)]
        y = np.diff(y, axis=0) / y[:-1, :]
        if len(y) < 2:
            corr[i, j] = np.nan
            continue
        y = pearsonr(y[:, 0], y[:, 1])[0]
        corr[i, j] = y
corr = corr + corr.T
np.fill_diagonal(corr, 1)

This takes within 8 minutes, which is fast enough for my use case.

On the other hand, this has been running for 30 minutes but still isn't done.

corr = pd.DataFrame(index=nav.columns, columns=nav.columns)
for col_i in df:
    for col_j in df:
        corr_ij = df[[col_i, col_j]].dropna(how='any', axis=0).pct_change().corr().iloc[0, 1]
        corr.loc[col_i, col_j] = corr_ij
t1 = time.time()

Don't know why this is but anyways the first one is a good enough solution for me now.

Sara
  • 245
  • 4
  • 11
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Apr 11 '22 at 07:54