To join complicated pandas tables

Question

I'm trying to join a dataframe of results from a statsmodels GLM to a dataframe designed to hold both univariate data and model results as models are iterated through. i'm having trouble figuring out how to grammatically join the two data sets.

I've consulted the pandas documentation found below to no luck:

https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging

This is difficult because of the out put of the model compared to the final table which holds values of each unique level of each unique variable.

See an example of what the data looks like with the code below:

import pandas as pd

df = {'variable': ['CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model'
                  ,'channel_model','channel_model','channel_model']
      , 'level': [0,100,200,250,500,750,1000, 'DIR', 'EA', 'IA']
      ,'value': [460955.7793,955735.0532,586308.4028,12216916.67,48401773.87,1477842.472,14587994.92,10493740.36
               ,36388470.44,31805316.37]}

final_table = pd.DataFrame(df)


df2 = {'variable': ['intercept','C(channel_model)[T.EA]','C(channel_model)[T.IA]', 'CLded_model']
       , 'coefficient': [-2.36E-14,-0.091195797,-0.244225888, 0.00174356]}

model_results = pd.DataFrame(df2)

After this is run you can see that for categorical variables, the value is incased in a few layers compared to the final_table. Numerical values such as CLded_model needs to be joined with the one coefficient it's associated with.

There is a lot to this and i'm not sure where to start.

Update: The following code produces the desired result:

d3 = {'variable': ['intercept', 'CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model'
                   ,'CLded_model','channel_model','channel_model','channel_model']
      , 'level': [None, 0,100,200,250,500,750,1000, 'DIR', 'EA', 'IA']
      ,'value': [None, 60955.7793,955735.0532,586308.4028,12216916.67,48401773.87,1477842.472,14587994.92,10493740.36
               ,36388470.44,31805316.37]
      , 'coefficient': [ -2.36E-14, 0.00174356,  0.00174356,  0.00174356,  0.00174356,  0.00174356 ,0.00174356
                        , 0.00174356,None, -0.091195797,-0.244225888, ]}

desired_result = pd.DataFrame(d3)

For this example, can you share what your desired output would look like? — Silenced Temporarily, Apr 19 '19 at 20:31
I agree, desired output is crucial here to understand what you want to achieve — Erfan, Apr 19 '19 at 20:32
Just got in my car. I will add a desired result when I get home. Thanks for pointing that out. — Jordan, Apr 19 '19 at 20:33
Perhaps https://stackoverflow.com/q/53645882/pandas-merging-101 would help? — cs95, Apr 19 '19 at 20:42
what is your desired output? draw on a paper and share the picture to us — Rebin, Apr 19 '19 at 21:07

score 1 · Accepted Answer · answered Apr 20 '19 at 05:59

First you have to clean df2:

df2['variable'] = df2['variable'].str.replace("C\(","")\
                                 .str.replace("\)\[T.", "-")\
                                 .str.strip("\]")

df2


       variable          coefficient
0   intercept           -2.360000e-14
1   channel_model-EA    -9.119580e-02
2   channel_model-IA    -2.442259e-01
3   CLded_model          1.743560e-03

Because you want to merge some of df1 on the level column and others not, we need to change df1 slightly to match df2:

df1.loc[df1['variable'] == 'channel_model', 'variable'] = "channel_model-"+df1.loc[df1['variable'] == 'channel_model', 'level']

df1

#snippet of what changed
      variable         level     value
6   CLded_model        1000   1.458799e+07
7   channel_model-DIR   DIR   1.049374e+07
8   channel_model-EA    EA    3.638847e+07
9   channel_model-IA    IA    3.180532e+07

Then we merge them:

df4 = df1.merge(df2, how = 'outer', left_on =['variable'], right_on = ['variable'])

And we get your result (except for the minor change in the variable name)

Thank you. Is there a way to do this and keep it in the final-table format? I have a function that takes the final table, loops through the variables and plots the level on the x-axis and the value I choose on the y-axis. — Jordan, Apr 20 '19 at 13:26
As in remove the changes to the variable column that I made? ```df4['variable']=df4['variable'].str.replace("-.*", '', regex = True)``` Will remove the changes to that column, and it will look identical to your df3. — Ben Pap, Apr 20 '19 at 17:58

To join complicated pandas tables

1 Answers1

Linked