0

I would like to generate a DataFrame similar to corr(), but with a different formulation.

For example, suppose I have a DataFrame

import pandas as pd
pa = pd.DataFrame()

pa['john']=[2,3,4,5,6]
pa['june']=[4,6,7,8,2]
pa['kate']=[3,2,3,4,5]

Pandas has the corr() build-in function that generates a new correlation DataFrame. So if I call pa.corr() is returns me

        john    june    kate 
john    1.000000    -0.131306   0.832050 
june    -0.131306    1.000000   -0.437014 
kate    0.832050    -0.437014   1.000000

I want to generate a new DataFrame like that, but with different formulation, for example,

        john                        june                        kate 
john    formula(john)*formula(john) formula(june)*formula(john) formula(kate)*formula(john)
june    formula(john)*formula(june) formula(june)*formula(june) formula(kate)*formula(june)
kate    formula(john)*formula(kate) formula(june)*formula(kate) formula(kate)*formula(kate)

where formula() calculates over one DataFrame column (could be, eg, formula(pa['john']) How can I do it?

Rego
  • 1,118
  • 1
  • 18
  • 40

1 Answers1

1

Here is a way to do it, not sure if it's the most straightforward

# random function
def formula(x,y):
   return sum(x*y)
import numpy as np
# create a list with tuples with all columns crossings
l = [(x,y) for x in pa.columns for y in pa.columns]
#[('john', 'john'),
# ('john', 'june'),
# ('john', 'kate'),
# ('june', 'john'),
# ('june', 'june'),
# ('june', 'kate'),
# ('kate', 'john'),
# ('kate', 'june'),
# ('kate', 'kate')]

# create dataframe with all info
# x = first element in tuple = one of pa column name
# y = second element in tuple = one of pa column name
# values = formula(pa[x],pa[y])
df = pd.DataFrame({'x': [el[0] for el in l], 
                   'y': [el[1] for el in l] ,
                   'values':[formula(pa[x],pa[y]) for x,y in l]} )

#   x   y   values
#0  john    john    90
#1  john    june    106
#2  john    kate    74
#3  june    john    106
#4  june    june    169
#5  june    kate    87
#6  kate    john    74
#7  kate    june    87
#8  kate    kate    63


# pivot df to obtain the format you want
table = pd.pivot_table(df, values='values', index=['x'],columns=['y'], aggfunc=np.sum).reset_index()


# y    x    john    june    kate
#0  john    90      106     74
#1  june    106     169     87
#2  kate    74      87      63
fmarm
  • 4,209
  • 1
  • 17
  • 29
  • Thank you!. The only problem is that it is sorting the columns. For this example specific it is no deal, but in real it is. How could I block pivot_table from sorting it? – Rego Nov 29 '19 at 22:02
  • 1
    I don't think you can do it directly with pivot_table but you can reorder the columns aftewards, see this for example https://stackoverflow.com/questions/11067027/re-ordering-columns-in-pandas-dataframe-based-on-column-name – fmarm Nov 30 '19 at 10:02