I am not comfortable with Python - much less intimidated and at ease with R. So indulge me on a silly question that is taking me a ton of searches without success.
I want to fit in a regression model with sklearn both with OLS and lasso. In particular, I like the mtcars dataset that is so easy to call in R, and, as it turns out, also very accessible in Python:
import statsmodels.api as sm
import pandas as pd
import statsmodels.formula.api as smf
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)
It looks like this:
mpg cyl disp hp drat ... qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 ... 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 ... 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 ... 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 ... 19.44 1 0 3 1
In trying to use LinearRegression()
the usual structure found is
import numpy as np
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(x, y)
but to do so, I need to select several columns of df
to fit into the regressors x
, and a column to be the independent variable y
. For example, I'd like to get an x
matrix that includes a column of 1's (for the intercept) as well as the disp
and qsec
(numerical variables), as well as cyl
(categorical variable). On the side of the independent variable, I'd like to use mpg
.
It would look if it were possible to word this way as
model = LinearRegression().fit(mpg ~['disp', 'qsec', C('cyl')], data=df)
But how do I go about the syntax for it?
Similarly, how can I do the same with lasso:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.001)
lasso.fit(mpg ~['disp', 'qsec', C('cyl')], data=df)
but again this is not the right syntax.
I did find that you can get the actual regression (OLS or lasso) by turning the dataframe into a matrix. However, the names of the columns are gone, and it is hard to read the variable corresponding to each coefficients. And I still haven't found a simple method to run diagnostic values, like p-values, or the r-square to begin with.