1

I've been trying for some time now to perform an ANOVA analysis on some of my data. In this case I'm trying to fit several models to see which one is the best for my experimental data. I'm using the OLS approach in statsmodel to fit linear, linear with interactions, and quadratic formulas. The dataset is made up out of a table that has three tested variables (Speed, Pressure, and Spacing) and the measured thickness (which is the variable I'm trying to fit the model for). For each variable there are 3 experimental points. Each experiment was performed 4 times. The data looks more or less like this:

ID       Pressure Speed Spacing  Thickness
0        20       10     110     24
1        20       25     120     23
2        20       25     100     26
3        20       40     110     27
4        26       10     120     33
5        26       10     100     37

The code I'm using is as follows:

import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl

data = pd.read_csv('ANOVA Data.csv') # Load data from .csv file
# data[['Pressure','Speed','Spacing']] = data[['Pressure','Speed','Spacing']].astype(str) # Tried to convert variables into strings to not confuse statsmodels, doesn't work with quadratics

model_t_lin = smf.ols(formula="Thickness ~C(Pressure) + C(Speed) + C(Spacing)",data = data).fit() # Linear OLS
model_t_int = smf.ols(formula="Thickness ~ C(Pressure) + C(Speed) + C(Spacing) + C(Pressure*Speed) + C(Pressure*Spacing) + C(Spacing*Speed)",data = data).fit() # Interactions OLS
model_t_qd = smf.ols(formula="Thickness ~ C(Pressure) + C(Speed) + C(Spacing) + C(Pressure*Speed) + C(Pressure*Spacing) + C(Spacing*Speed) + C(Pressure**2) + C(Spacing**2) + C(Speed**2)",data = data).fit() # Quadratic OLS

models = [model_t_lin,model_t_int,model_t_qd] # Assemble vector for ease of use

modelnoms = ['t_linear','t_interactions','t_quadratic'] # Model names

A_type = 3 # Set Anova Type

for i in range(0,len(models)):
    print(' ')
    print(modelnoms[i]+':')
    print(models[i].summary())
    print(' ')
    print(sm.stats.anova_lm(models[i],typ=A_type))
    
# Determine and normalize coefficients of models
a_t_lin = model_t_lin.params/model_t_lin.params[0]
a_t_int = model_t_int.params/model_t_int.params[0]
a_t_qd = model_t_qd.params/model_t_qd.params[0]

coeff = [a_t_lin,a_t_int,a_t_qd] # Coefficients vector for plotting

fullnoms = ['Thickness - linear','Thickness - interactions','Thickness - quadratic']

mpl.rc('font',family='Verdana')
mpl.rcParams.update({'figure.autolayout': True})

for i in range(0,len(models)):
    plt.figure()
    plt.title(fullnoms[i])
    coeff[i].plot.bar()
    plt.show()

Considering that my variables are all numbers I tried to either pass them as strings, which didn't work well with the quadratic terms, or as categories, which gave no errors. The former seems to work best for the ANOVA. However, when I'm trying to determine the coefficients of my model I get two coefficients from the stats.summary, while expecting just one.

The second problem seems that my quadratic terms are ignored in the ANOVA. I guess this makes sense in some way because I'm using the anova_lm, which as I understand only treats linear models. However, then I don't understand why it has no problem with the interactions formula which is also non-linear, unless it's providing a wrong result.

Does anyone have any experience using this package or would there be a better way to perform this analysis?

Lythes
  • 11
  • 2
  • iI you treat the explanatory variable as categorical, than taking square doesn't change it, C(Pressure) and C(Pressure**2) should both create the same dummy coding. OLS `summary()` should show a warning in this case. – Josef Sep 22 '21 at 12:16

1 Answers1

0

Taking inspiration from this question, you could try to wrap your quadratic terms into I(...), like this:

 model_t_qd = smf.ols(
    formula="Thickness ~ Pressure + Speed + Spacing + Pressure*Speed + Pressure*Spacing + Spacing*Speed + I(Pressure*Pressure) + I(Spacing*Spacing) + I(Speed*Speed)",
    data=data).fit() # Quadratic OLS

I also wrote your other factors as continuous, not categorial, this seems to make more sense to me.

I just tried it with an example dataset and for me it worked. I cross-checked the results with a commercial software and they were 100% identical.

Does this make sense when using anova_lm? Well, unfortunately, I do not know enough detail about how anova_lm works in the background to say something smart about that.

Snijderfrey
  • 245
  • 3
  • 12