0

I'm trying to join a dataframe of results from a statsmodels GLM to a dataframe designed to hold both univariate data and model results as models are iterated through. i'm having trouble figuring out how to grammatically join the two data sets.

I've consulted the pandas documentation found below to no luck:

https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging

This is difficult because of the out put of the model compared to the final table which holds values of each unique level of each unique variable.

See an example of what the data looks like with the code below:

import pandas as pd

df = {'variable': ['CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model'
                  ,'channel_model','channel_model','channel_model']
      , 'level': [0,100,200,250,500,750,1000, 'DIR', 'EA', 'IA']
      ,'value': [460955.7793,955735.0532,586308.4028,12216916.67,48401773.87,1477842.472,14587994.92,10493740.36
               ,36388470.44,31805316.37]}

final_table = pd.DataFrame(df)


df2 = {'variable': ['intercept','C(channel_model)[T.EA]','C(channel_model)[T.IA]', 'CLded_model']
       , 'coefficient': [-2.36E-14,-0.091195797,-0.244225888, 0.00174356]}

model_results = pd.DataFrame(df2)

After this is run you can see that for categorical variables, the value is incased in a few layers compared to the final_table. Numerical values such as CLded_model needs to be joined with the one coefficient it's associated with.

There is a lot to this and i'm not sure where to start.

Update: The following code produces the desired result:

d3 = {'variable': ['intercept', 'CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model'
                   ,'CLded_model','channel_model','channel_model','channel_model']
      , 'level': [None, 0,100,200,250,500,750,1000, 'DIR', 'EA', 'IA']
      ,'value': [None, 60955.7793,955735.0532,586308.4028,12216916.67,48401773.87,1477842.472,14587994.92,10493740.36
               ,36388470.44,31805316.37]
      , 'coefficient': [ -2.36E-14, 0.00174356,  0.00174356,  0.00174356,  0.00174356,  0.00174356 ,0.00174356
                        , 0.00174356,None, -0.091195797,-0.244225888, ]}

desired_result = pd.DataFrame(d3)
Jordan
  • 1,415
  • 3
  • 18
  • 44

1 Answers1

1

First you have to clean df2:

df2['variable'] = df2['variable'].str.replace("C\(","")\
                                 .str.replace("\)\[T.", "-")\
                                 .str.strip("\]")

df2


       variable          coefficient
0   intercept           -2.360000e-14
1   channel_model-EA    -9.119580e-02
2   channel_model-IA    -2.442259e-01
3   CLded_model          1.743560e-03

Because you want to merge some of df1 on the level column and others not, we need to change df1 slightly to match df2:

df1.loc[df1['variable'] == 'channel_model', 'variable'] = "channel_model-"+df1.loc[df1['variable'] == 'channel_model', 'level']

df1

#snippet of what changed
      variable         level     value
6   CLded_model        1000   1.458799e+07
7   channel_model-DIR   DIR   1.049374e+07
8   channel_model-EA    EA    3.638847e+07
9   channel_model-IA    IA    3.180532e+07

Then we merge them:

df4 = df1.merge(df2, how = 'outer', left_on =['variable'], right_on = ['variable'])

And we get your result (except for the minor change in the variable name)

Ben Pap
  • 2,549
  • 1
  • 8
  • 17
  • Thank you. Is there a way to do this and keep it in the final-table format? I have a function that takes the final table, loops through the variables and plots the level on the x-axis and the value I choose on the y-axis. – Jordan Apr 20 '19 at 13:26
  • 1
    As in remove the changes to the variable column that I made? ```df4['variable']=df4['variable'].str.replace("-.*", '', regex = True)``` Will remove the changes to that column, and it will look identical to your df3. – Ben Pap Apr 20 '19 at 17:58