I have a pandas data frame and I'm trying to find the covariance of the percentage change of each column. For each pair, I want rows with missing values to be dropped, and the percentage be calculated afterwards. That is, I want something like this:
import pandas as pd
import numpy as np
# create dataframe example
N_ROWS, N_COLS = 249, 3535
df = pd.DataFrame(np.random.random((N_ROWS, N_COLS)))
df.iloc[np.random.choice(N_ROWS, N_COLS), np.random.choice(10, 50)] = np.nan
cov_df = pd.DataFrame(index=df.columns, columns=df.columns)
for col_i in df:
for col_j in df:
cov = df[[col_i, col_j]].dropna(how='any', axis=0).pct_change().cov()
cov_df.loc[col_i, col_j] = cov.iloc[0, 1]
The thing is this is super slow. The code below gives me results that is similar (but not exactly) what I want, but it runs quite fast
df.dropna(how='any', axis=0).pct_change().cov()
I am not sure why the second one runs so much faster. I want to speed up my code in the first, but I can't figure out how.
I have tried using combinations
from itertools
to avoid repeating the calculation for (col_i, col_j)
and (col_j, col_i)
, and using map
from multiprocessing
to do the computations in parallel, but it still hasn't finished running after 90+ mintues.