When curve fitting with Python's statsmodels' OLS linear regression, how do I choose the constant in the formula?

Question

I'd like to fit linear regression models of different degrees to a data set and choose the best fitting one based on adjusted r^2.
Based on other answers, I'm using the OLS formula "y ~ 1 + " + " + ".join("I(x**{})".format(i) for i in range(1, degree+1)),
I don't have enough statistics knowledge to understand: is the 1 + constant needed and, if so, what should the constant value be?

import numpy
import pandas
import matplotlib
import matplotlib.offsetbox
import statsmodels.tools
import statsmodels.formula.api

data = numpy.array([
  [1999, 197.0],
  [2000, 196.5],
  [2001, 194.3],
  [2002, 193.7],
  [2003, 192.0],
  [2004, 189.2],
  [2005, 189.3],
  [2006, 187.6],
  [2007, 186.9],
  [2008, 186.0],
  [2009, 185.0],
  [2010, 186.2],
  [2011, 185.1],
  [2012, 185.6],
  [2013, 185.0],
  [2014, 185.6],
  [2015, 185.4],
  [2016, 185.1],
  [2017, 183.9],
])

df = pandas.DataFrame(data, columns=["Year", "CrudeRate"])

cause = "Malignant neoplasms"
x = df["Year"].values
y = df["CrudeRate"].values
degree = 2
predict_future_years = 5

# https://stackoverflow.com/a/34617603/4135310
olsdata = {"x": x, "y": y}
formula = "y ~ 1 + " + " + ".join("I(x**{})".format(i) for i in range(1, degree+1))
model = statsmodels.formula.api.ols(formula, olsdata).fit()

print(model.summary())

ax = df.plot("Year", "CrudeRate", kind="scatter", grid=True, title="Deaths from {}".format(cause))

# https://stackoverflow.com/a/37294651/4135310
func = numpy.poly1d(model.params.values[::-1])
matplotlib.pyplot.plot(df["Year"], func(df["Year"]))

predicted = func(df.Year.values[-1] + predict_future_years)
print("Predicted in {} years: {}".format(predict_future_years, predicted))

ax.add_artist(matplotlib.offsetbox.AnchoredText("$\\barR^2$ = {:0.2f}".format(model.rsquared_adj), loc="upper center"))
ax.add_artist(matplotlib.offsetbox.AnchoredText("Predicted in +{} = {:0.2f}".format(predict_future_years, predicted), loc="upper right"))

ax.xaxis.set_major_formatter(matplotlib.ticker.FormatStrFormatter("%d"))
fig = matplotlib.pyplot.gcf()
fig.autofmt_xdate(bottom=0.2, rotation=30, ha="right", which="both")
matplotlib.pyplot.tight_layout()
cleaned_title = cause.replace(" ", "_").replace("(", "").replace(")", "")
#matplotlib.pyplot.savefig("{}_{}.png".format(cleaned_title, degree), dpi=100)
matplotlib.pyplot.show()

Using the `Patsy` notation, you don't need to include `+1`, as the constant is by default added to your model. If you need to explicitly remove the constant, you do so with `-1`. That being said specifying `y ~ 1 + ...` does **NOT** specify that your model has a constant that takes on the value of 1. It instead specifies that you have a Constant, whose magnitude will be given by the `Interecept` coeff. This is the constant determined by OLS, so it's the one you want. — ALollz, Feb 07 '19 at 21:56
If instead you do truly need to fit the formula `y = 5 + Beta*x`, this can be very tricky. If you don't need much of other than the coefficients and the covariance matrix, then [`scipy.optimize.curve_fit`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html) is easy to implement. — ALollz, Feb 07 '19 at 22:05
just to the fixed constant case: `y = 5 + Beta*x` can be estimated by `y - 5 = Beta*x` in OLS (and GLM has an `offset` options for the same purpose in the nonlinear case) — Josef, Feb 08 '19 at 02:18

score 1 · Accepted Answer · answered Feb 17 '19 at 15:42

Based on comments from @ALollz, when using Patsy notation (e.g. statsmodels.formula.api.ols("y ~ x")), you don't need to include 1 +, as the constant is added by default to the model, although this does not specify that your model has a constant that takes on the value of 1. Instead, it specifies that you have a constant whose magnitude will be given by the intercept coefficient. This is the constant determined by OLS, so it's the one you want.

When curve fitting with Python's statsmodels' OLS linear regression, how do I choose the constant in the formula?

1 Answers1