2
  • I'd like to fit linear regression models of different degrees to a data set and choose the best fitting one based on adjusted r^2.
  • Based on other answers, I'm using the OLS formula "y ~ 1 + " + " + ".join("I(x**{})".format(i) for i in range(1, degree+1)),
  • I don't have enough statistics knowledge to understand: is the 1 + constant needed and, if so, what should the constant value be?
import numpy
import pandas
import matplotlib
import matplotlib.offsetbox
import statsmodels.tools
import statsmodels.formula.api

data = numpy.array([
  [1999, 197.0],
  [2000, 196.5],
  [2001, 194.3],
  [2002, 193.7],
  [2003, 192.0],
  [2004, 189.2],
  [2005, 189.3],
  [2006, 187.6],
  [2007, 186.9],
  [2008, 186.0],
  [2009, 185.0],
  [2010, 186.2],
  [2011, 185.1],
  [2012, 185.6],
  [2013, 185.0],
  [2014, 185.6],
  [2015, 185.4],
  [2016, 185.1],
  [2017, 183.9],
])

df = pandas.DataFrame(data, columns=["Year", "CrudeRate"])

cause = "Malignant neoplasms"
x = df["Year"].values
y = df["CrudeRate"].values
degree = 2
predict_future_years = 5

# https://stackoverflow.com/a/34617603/4135310
olsdata = {"x": x, "y": y}
formula = "y ~ 1 + " + " + ".join("I(x**{})".format(i) for i in range(1, degree+1))
model = statsmodels.formula.api.ols(formula, olsdata).fit()

print(model.summary())

ax = df.plot("Year", "CrudeRate", kind="scatter", grid=True, title="Deaths from {}".format(cause))

# https://stackoverflow.com/a/37294651/4135310
func = numpy.poly1d(model.params.values[::-1])
matplotlib.pyplot.plot(df["Year"], func(df["Year"]))

predicted = func(df.Year.values[-1] + predict_future_years)
print("Predicted in {} years: {}".format(predict_future_years, predicted))

ax.add_artist(matplotlib.offsetbox.AnchoredText("$\\barR^2$ = {:0.2f}".format(model.rsquared_adj), loc="upper center"))
ax.add_artist(matplotlib.offsetbox.AnchoredText("Predicted in +{} = {:0.2f}".format(predict_future_years, predicted), loc="upper right"))

ax.xaxis.set_major_formatter(matplotlib.ticker.FormatStrFormatter("%d"))
fig = matplotlib.pyplot.gcf()
fig.autofmt_xdate(bottom=0.2, rotation=30, ha="right", which="both")
matplotlib.pyplot.tight_layout()
cleaned_title = cause.replace(" ", "_").replace("(", "").replace(")", "")
#matplotlib.pyplot.savefig("{}_{}.png".format(cleaned_title, degree), dpi=100)
matplotlib.pyplot.show()

Figure

  • Using the `Patsy` notation, you don't need to include `+1`, as the constant is by default added to your model. If you need to explicitly remove the constant, you do so with `-1`. That being said specifying `y ~ 1 + ...` does **NOT** specify that your model has a constant that takes on the value of 1. It instead specifies that you have a Constant, whose magnitude will be given by the `Interecept` coeff. This is the constant determined by OLS, so it's the one you want. – ALollz Feb 07 '19 at 21:56
  • If instead you do truly need to fit the formula `y = 5 + Beta*x`, this can be very tricky. If you don't need much of other than the coefficients and the covariance matrix, then [`scipy.optimize.curve_fit`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html) is easy to implement. – ALollz Feb 07 '19 at 22:05
  • @ALollz Thanks! – freeradical Feb 07 '19 at 22:37
  • just to the fixed constant case: `y = 5 + Beta*x` can be estimated by `y - 5 = Beta*x` in OLS (and GLM has an `offset` options for the same purpose in the nonlinear case) – Josef Feb 08 '19 at 02:18
  • @ALollz -- Your comments would make a good answer. – Jasper Feb 13 '19 at 20:11

1 Answers1

1

Based on comments from @ALollz, when using Patsy notation (e.g. statsmodels.formula.api.ols("y ~ x")), you don't need to include 1 +, as the constant is added by default to the model, although this does not specify that your model has a constant that takes on the value of 1. Instead, it specifies that you have a constant whose magnitude will be given by the intercept coefficient. This is the constant determined by OLS, so it's the one you want.