How to Perform a Complicated Join in Pandas with Interaction Terms from Statsmodel output

Question

This is an extension of this quetion: To join complicated pandas tables

I have three different interactions in a statsmodels GLM. I need a final table that pairs coeficients with other univariate analysis results.

Below is an example of what the tables look like with a marital status and age interaction in the model. The final_table is the table that has the univariate results in. I want to join coefficient values (among other statistics, p_values, standard_error etc) from the model results to that final table (this is model_results in the code below).

df = {'variable': ['CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model'
                   ,'married_age','married_age','married_age', 'class_cc', 'class_cc', 'class_cc', 'class_cc', 'class_v_age'
                  ,'class_v_age','class_v_age', 'class_v_age'],
      'level': [0,100,200,250,500,750,1000, 'M_60', 'M_61', 'S_62', 'Harley_100', 'Harley_1200', 'Sport_1500', 'other_100'
                ,'Street_10', 'other_20', 'Harley_15', 'Sport_10'],
      'value': [460955.7793,955735.0532,586308.4028,12216916.67,48401773.87,1477842.472,14587994.92,10493740.36,36388470.44
                ,31805316.37, 123.4, 4546.50, 439854.23, 2134.4, 2304.5, 2032.30, 159.80, 22]}


final_table1 = pd.DataFrame(df)
final_table1

Join the above with it's different ways statsmodels communicates the results to:

df2 = {'variable': ['intercept','driver_age_model:C(marital_status_model)[M]', 'driver_age_model:C(marital_status_model)[S]'
                    , 'CLded_model','C(class_model)[Harley]:v_age_model', 'C(class_model)[Sport]:v_age_model'
                    ,'C(class_model)[Street]:v_age_model', 'C(class_model)[other]:v_age_model'
                    , 'C(class_model)[Harley]:cc_model', 'C(class_model)[Sport]:cc_model' , 'C(class_model)[Street]:cc_model'
                    , 'C(class_model)[other]:cc_model']
       ,'coefficient': [-2.36E-14,-1.004648e-02,-1.071730e-02, 0.00174356,-0.07222433,-0.146594998,-0.168168491,-0.084420399
                        ,-0.000181233,0.000872798,0.001229771,0.001402564]}
model_results = pd.DataFrame(df2)
model_results

With the desired final result:

df3 = {'variable': ['intercept', 'CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model'
                   ,'married_age','married_age','married_age', 'class_cc', 'class_cc', 'class_cc', 'class_cc', 'class_v_age'
                  ,'class_v_age','class_v_age', 'class_v_age'],
      'level': [None,0,100,200,250,500,750,1000, 'M_60', 'M_61', 'S_62', 'Harley_100', 'Harley_1200', 'Sport_1500', 'other_100'
                ,'Street_10', 'other_20', 'Harley_15', 'Sport_10'],
      'value': [None, 460955.7793,955735.0532,586308.4028,12216916.67,48401773.87,1477842.472,14587994.92,10493740.36,36388470.44
                ,31805316.37, 123.4, 4546.50, 439854.23, 2134.4, 2304.5, 2032.30, 159.80, 22],
       'coefficient': [-2.36E-14, 0.00174356,  0.00174356,  0.00174356,  0.00174356,  0.00174356 ,0.00174356 , 0.00174356
                       ,-1.004648e-02, -1.004648e-02,-1.071730e-02,-1.812330e-04,-1.812330e-04,8.727980e-04,1.402564e-03
                      ,-1.681685e-01, -8.442040e-02, -1.812330e-04, -1.465950e-01]}

results = pd.DataFrame(df3)
results

When I implement the first answer, it effected this answer.

score 1 · Accepted Answer · answered Apr 24 '19 at 18:06

df = {'variable': ['CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','married_age','married_age','married_age'],
      'level': [0,100,200,250,500,750,1000, 'M_60', 'M_61', 'S_62'],
      'value': [460955.7793,955735.0532,586308.4028,12216916.67,48401773.87,1477842.472,14587994.92,10493740.36,36388470.44,31805316.37]}


df2 = {'variable': ['intercept','driver_age_model:C(marital_status_model)[M]', 'driver_age_model:C(marital_status_model)[S]', 'CLded_model'],
       'coefficient': [-2.36E-14,-1.004648e-02,-1.071730e-02, 0.00174356]}

df3 = {'variable': ['intercept', 'CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','married_age','married_age','married_age'],
       'level': [None, 0,100,200,250,500,750,1000, 'M_60', 'M_61', 'S_62'],
       'value': [None, 60955.7793,955735.0532,586308.4028,12216916.67,48401773.87,1477842.472,14587994.92,10493740.36, 36388470.44,31805316.37],
       'coefficient': [-2.36E-14, 0.00174356,  0.00174356,  0.00174356,  0.00174356,  0.00174356 ,0.00174356 , 0.00174356,-1.004648e-02, -1.004648e-02,-1.071730e-02]}

final_table = pd.DataFrame(df)
model_results = pd.DataFrame(df2)
results = pd.DataFrame(df3)

# Change slightly df to match what we're going to merge
final_table.loc[final_table['variable'] == 'married_age', 'variable'] = 'married_age-'+final_table.loc[final_table['variable'] == 'married_age', 'level'].str[0]

# Clean df2 and get it ready for merge
model_results['variable'] = model_results['variable'].str.replace('driver_age_model:C\(marital_status_model\)\[', 'married_age-')\
                                                     .str.strip('\]')
# Merge
df4 = final_table.merge(model_results, how = 'outer', left_on = 'variable', right_on = 'variable')

#Clean
df4['variable'] = df4['variable'].str.replace('-.*', '', regex = True)

Pretty much the same thing as last time, the only difference was how you clean df2.

Thank you @Ben Pap. What if you had to clearn `C(class_model)`[Harley]:cc_model` to match `class_cc' with a level of `Harley_100' and 'Harley_200' and so on? Would it be `coeffs['index'] = coeffs['index'].str.replace('C\(class_model\)\[', 'class_cc-').str.strip('\]')` — Jordan, Apr 25 '19 at 09:17
Acutally, I see what's happening. The first answer is effecting the second one. I'm going to edit this post to merge both. — Jordan, Apr 25 '19 at 09:23
Figured out how to do the model result table. `coeffs['index'] = coeffs['index'].str.replace('v_age_model:C\(class_model\)\[', 'class_v_age-')\ .str.strip('\]') coeffs['index'] = coeffs['index'].str.replace('cc_model:C\(class_model\)\[', 'class_cc-')\ .str.strip('\]') ` Now to figure out how to get the final table to bring over the correct amount of letters depending on if it's harley, sport, street, etc. — Jordan, Apr 25 '19 at 11:12
GOT IT! This is awesome. `final_table.loc[final_table['variable'] == 'class_cc', 'variable'] = 'class_cc-'+final_table.loc[final_table['variable'] =='class_cc' ,'unique_value'].str[0:6].str.strip("\_")` — Jordan, Apr 25 '19 at 11:31

How to Perform a Complicated Join in Pandas with Interaction Terms from Statsmodel output

1 Answers1