Pandas apply function by columns

Question

I have a dataframe with dates (30/09/2022 to 31/11/2022) and 15 stock prices (wrote 5 as reference) for each of these dates (excluding weekends).

Current Data:

   DATES   |  A  |  B  |  C  |  D  |  E  |
 30/09/22 |100.5|151.3|233.4|237.2|38.42|
 01/10/22 |101.5|148.0|237.6|232.2|38.54|
 02/10/22 |102.2|147.6|238.3|231.4|39.32|
 03/10/22 |103.4|145.7|239.2|232.2|39.54|

I wanted to get the Pearson correlation matrix, so I did this:

df = pd.read_excel(file_path, sheet_name)
df=df.dropna() #Remove dates that do not have prices for all stocks
log_df = df.set_index("DATES").pipe(lambda d: np.log(d.div(d.shift()))).reset_index()
corrM = log_df.corr()

Now I want to build the Pearson Uncentered Correlation Matrix, so I have the following function:

def uncentered_correlation(x, y):

    x_dim = len(x)
    y_dim = len(y)
    
    xy = 0
    xx = 0
    yy = 0
    for i in range(x_dim):
        xy = xy + x[i] * y[i]
        xx = xx + x[i] ** 2.0
        yy = yy + y[i] ** 2.0
        
    corr = xy/np.sqrt(xx*yy)
    return(corr)

However, I do not know how to apply this function to each possible pair of columns of the dataframe to get the correlation matrix.

itsivyaaaaaaa · Accepted Answer · 2022-12-27T11:46:47.563

2

try this? not elegant enough, but perhaps working for you. :)

from itertools import product

def iter_product(a, b):
    return list(product(a, b))

df='your dataframe hier'
re_dict={}
iter_re=iter_product(df.columns,df.columns)
for i in iter_re:    
    result=uncentered_correlation(df[f'{i[0]}'],df[f'{i[1]}'])
    re_dict[i]=result
re_df=pd.DataFrame(re_dict,index=[0]).stack()

edited Dec 27 '22 at 11:46

answered Dec 27 '22 at 11:40

itsivyaaaaaaa

32
2

Hey! Thanks for your answer. iter_re generates all the possible pairs, so it's a start for me. But once I reach the for, all the values calculated are nan. Should I change any of the two values which are given to the uncentered_correlation function? – Guillem Dec 27 '22 at 12:10
Ignore my last comment, that worked perfectly. I forgot that the first day return is always nan since it does not have the day before to make the calculation. And I do not care about the elegance, as long as it works, I'll be happy! – Guillem Dec 27 '22 at 12:30
1

@Guillem Perfect. Have fun with it. :) – itsivyaaaaaaa Dec 27 '22 at 13:27

score 1 · Answer 2 · answered Dec 27 '22 at 11:38

First compute a list of possible column combinations. You can use the itertools library for that
Then use the pandas.DataFrame.apply() over multiple columns as explained here

Here is a simple code example:

import pandas as pd
import itertools

data = {'col1': [1,3], 'col2': [2,4], 'col3': [5,6]}
df = pd.DataFrame(data)

def add(num1,num2):
    return num1 + num2

cols = list(df)
combList = list(itertools.combinations(cols, 2))

for tup in combList:
    firstCol = tup[0]
    secCol = tup[1]
    df[f'sum_{firstCol}_{secCol}'] = df.apply(lambda x: add(x[firstCol], x[secCol]), axis=1)

Hello! Thanks for your answer. With the first instruction I was able to generate the list of all possible combinations. However, once the correlation function is called, I get the following error: "invalid index to scalar variable" on xy = xy + x[i] * y[i] — Guillem, Dec 27 '22 at 12:17

Pandas apply function by columns

2 Answers2