Cross-Formulation DataFrame Like corr() in Python

Question

I would like to generate a DataFrame similar to corr(), but with a different formulation.

For example, suppose I have a DataFrame

import pandas as pd
pa = pd.DataFrame()

pa['john']=[2,3,4,5,6]
pa['june']=[4,6,7,8,2]
pa['kate']=[3,2,3,4,5]

Pandas has the corr() build-in function that generates a new correlation DataFrame. So if I call pa.corr() is returns me

        john    june    kate 
john    1.000000    -0.131306   0.832050 
june    -0.131306    1.000000   -0.437014 
kate    0.832050    -0.437014   1.000000

I want to generate a new DataFrame like that, but with different formulation, for example,

        john                        june                        kate 
john    formula(john)*formula(john) formula(june)*formula(john) formula(kate)*formula(john)
june    formula(john)*formula(june) formula(june)*formula(june) formula(kate)*formula(june)
kate    formula(john)*formula(kate) formula(june)*formula(kate) formula(kate)*formula(kate)

where formula() calculates over one DataFrame column (could be, eg, formula(pa['john']) How can I do it?

fmarm · Answer 1 · 2019-11-29T00:38:50.080

Here is a way to do it, not sure if it's the most straightforward

# random function
def formula(x,y):
   return sum(x*y)
import numpy as np
# create a list with tuples with all columns crossings
l = [(x,y) for x in pa.columns for y in pa.columns]
#[('john', 'john'),
# ('john', 'june'),
# ('john', 'kate'),
# ('june', 'john'),
# ('june', 'june'),
# ('june', 'kate'),
# ('kate', 'john'),
# ('kate', 'june'),
# ('kate', 'kate')]

# create dataframe with all info
# x = first element in tuple = one of pa column name
# y = second element in tuple = one of pa column name
# values = formula(pa[x],pa[y])
df = pd.DataFrame({'x': [el[0] for el in l], 
                   'y': [el[1] for el in l] ,
                   'values':[formula(pa[x],pa[y]) for x,y in l]} )

#   x   y   values
#0  john    john    90
#1  john    june    106
#2  john    kate    74
#3  june    john    106
#4  june    june    169
#5  june    kate    87
#6  kate    john    74
#7  kate    june    87
#8  kate    kate    63


# pivot df to obtain the format you want
table = pd.pivot_table(df, values='values', index=['x'],columns=['y'], aggfunc=np.sum).reset_index()


# y    x    john    june    kate
#0  john    90      106     74
#1  june    106     169     87
#2  kate    74      87      63

Thank you!. The only problem is that it is sorting the columns. For this example specific it is no deal, but in real it is. How could I block pivot_table from sorting it? — Rego, Nov 29 '19 at 22:02
I don't think you can do it directly with pivot_table but you can reorder the columns aftewards, see this for example https://stackoverflow.com/questions/11067027/re-ordering-columns-in-pandas-dataframe-based-on-column-name — fmarm, Nov 30 '19 at 10:02

Cross-Formulation DataFrame Like corr() in Python

1 Answers1