I've been trying for some time now to perform an ANOVA analysis on some of my data. In this case I'm trying to fit several models to see which one is the best for my experimental data. I'm using the OLS approach in statsmodel to fit linear, linear with interactions, and quadratic formulas. The dataset is made up out of a table that has three tested variables (Speed, Pressure, and Spacing) and the measured thickness (which is the variable I'm trying to fit the model for). For each variable there are 3 experimental points. Each experiment was performed 4 times. The data looks more or less like this:
ID Pressure Speed Spacing Thickness
0 20 10 110 24
1 20 25 120 23
2 20 25 100 26
3 20 40 110 27
4 26 10 120 33
5 26 10 100 37
The code I'm using is as follows:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
data = pd.read_csv('ANOVA Data.csv') # Load data from .csv file
# data[['Pressure','Speed','Spacing']] = data[['Pressure','Speed','Spacing']].astype(str) # Tried to convert variables into strings to not confuse statsmodels, doesn't work with quadratics
model_t_lin = smf.ols(formula="Thickness ~C(Pressure) + C(Speed) + C(Spacing)",data = data).fit() # Linear OLS
model_t_int = smf.ols(formula="Thickness ~ C(Pressure) + C(Speed) + C(Spacing) + C(Pressure*Speed) + C(Pressure*Spacing) + C(Spacing*Speed)",data = data).fit() # Interactions OLS
model_t_qd = smf.ols(formula="Thickness ~ C(Pressure) + C(Speed) + C(Spacing) + C(Pressure*Speed) + C(Pressure*Spacing) + C(Spacing*Speed) + C(Pressure**2) + C(Spacing**2) + C(Speed**2)",data = data).fit() # Quadratic OLS
models = [model_t_lin,model_t_int,model_t_qd] # Assemble vector for ease of use
modelnoms = ['t_linear','t_interactions','t_quadratic'] # Model names
A_type = 3 # Set Anova Type
for i in range(0,len(models)):
print(' ')
print(modelnoms[i]+':')
print(models[i].summary())
print(' ')
print(sm.stats.anova_lm(models[i],typ=A_type))
# Determine and normalize coefficients of models
a_t_lin = model_t_lin.params/model_t_lin.params[0]
a_t_int = model_t_int.params/model_t_int.params[0]
a_t_qd = model_t_qd.params/model_t_qd.params[0]
coeff = [a_t_lin,a_t_int,a_t_qd] # Coefficients vector for plotting
fullnoms = ['Thickness - linear','Thickness - interactions','Thickness - quadratic']
mpl.rc('font',family='Verdana')
mpl.rcParams.update({'figure.autolayout': True})
for i in range(0,len(models)):
plt.figure()
plt.title(fullnoms[i])
coeff[i].plot.bar()
plt.show()
Considering that my variables are all numbers I tried to either pass them as strings, which didn't work well with the quadratic terms, or as categories, which gave no errors. The former seems to work best for the ANOVA. However, when I'm trying to determine the coefficients of my model I get two coefficients from the stats.summary, while expecting just one.
The second problem seems that my quadratic terms are ignored in the ANOVA. I guess this makes sense in some way because I'm using the anova_lm, which as I understand only treats linear models. However, then I don't understand why it has no problem with the interactions formula which is also non-linear, unless it's providing a wrong result.
Does anyone have any experience using this package or would there be a better way to perform this analysis?