133

I have a pandas data frame and I would like to able to predict the values of column A from the values in columns B and C. Here is a toy example:

import pandas as pd
df = pd.DataFrame({"A": [10,20,30,40,50], 
                   "B": [20, 30, 10, 40, 50], 
                   "C": [32, 234, 23, 23, 42523]})

Ideally, I would have something like ols(A ~ B + C, data = df) but when I look at the examples from algorithm libraries like scikit-learn it appears to feed the data to the model with a list of rows instead of columns. This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place. What is the most pythonic way to run an OLS regression (or any machine learning algorithm more generally) on data in a pandas data frame?

denfromufa
  • 5,610
  • 13
  • 81
  • 138
Michael
  • 13,244
  • 23
  • 67
  • 115

6 Answers6

186

I think you can almost do exactly what you thought would be ideal, using the statsmodels package which was one of pandas' optional dependencies before pandas' version 0.20.0 (it was used for a few things in pandas.stats.)

>>> import pandas as pd
>>> import statsmodels.formula.api as sm
>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
>>> result = sm.ols(formula="A ~ B + C", data=df).fit()
>>> print(result.params)
Intercept    14.952480
B             0.401182
C             0.000352
dtype: float64
>>> print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      A   R-squared:                       0.579
Model:                            OLS   Adj. R-squared:                  0.158
Method:                 Least Squares   F-statistic:                     1.375
Date:                Thu, 14 Nov 2013   Prob (F-statistic):              0.421
Time:                        20:04:30   Log-Likelihood:                -18.178
No. Observations:                   5   AIC:                             42.36
Df Residuals:                       2   BIC:                             41.19
Df Model:                           2                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     14.9525     17.764      0.842      0.489       -61.481    91.386
B              0.4012      0.650      0.617      0.600        -2.394     3.197
C              0.0004      0.001      0.650      0.583        -0.002     0.003
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   1.061
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.498
Skew:                          -0.123   Prob(JB):                        0.780
Kurtosis:                       1.474   Cond. No.                     5.21e+04
==============================================================================

Warnings:
[1] The condition number is large, 5.21e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Bill
  • 44,502
  • 24
  • 122
  • 213
DSM
  • 342,061
  • 65
  • 592
  • 494
  • 2
    Note that correct keyword is `formula`, I accidentally typed `formulas` instead and got weird error: `TypeError: from_formula() takes at least 3 arguments (2 given)` – denfromufa Nov 14 '16 at 18:19
  • @DSM Very new to python. Tried running your same code and got errors on both print messages: print result.summary() ^ SyntaxError: invalid syntax >>> print result.parmas File "", line 1 print result.parmas ^ SyntaxError: Missing parentheses in call to 'print'...Maybe I loaded packages wrong?? It appears to work when I don't put "print". Thanks. – a.powell Apr 01 '17 at 00:45
  • 2
    @a.powell The OP's code is for Python 2. The only change I think you need to make is to put parentheses round the arguments to print: `print(result.params)` and `print(result.summary())` – Paul Moore Apr 07 '17 at 14:35
  • attempting to use this `formula()` approach throws the type error TypeError: __init__() missing 1 required positional argument: 'endog', so i guess it's deprecated. also, `ols` is now `OLS` – 3pitt May 18 '18 at 18:50
  • 1
    As others mention, sm.ols has been deprecated in favor of sm.OLS. The default behavior is also different. To run a regression from formula as done here, you need to do: `result = sm.OLS.from_formula(formula="A ~ B + C", data=df).fit()` – Lucas H Feb 25 '19 at 18:37
77

Note: pandas.stats has been removed with 0.20.0


It's possible to do this with pandas.stats.ols:

>>> from pandas.stats.api import ols
>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
>>> res = ols(y=df['A'], x=df[['B','C']])
>>> res
-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <B> + <C> + <intercept>

Number of Observations:         5
Number of Degrees of Freedom:   3

R-squared:         0.5789
Adj R-squared:     0.1577

Rmse:             14.5108

F-stat (2, 2):     1.3746, p-value:     0.4211

Degrees of Freedom: model 2, resid 2

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             B     0.4012     0.6497       0.62     0.5999    -0.8723     1.6746
             C     0.0004     0.0005       0.65     0.5826    -0.0007     0.0014
     intercept    14.9525    17.7643       0.84     0.4886   -19.8655    49.7705
---------------------------------End of Summary---------------------------------

Note that you need to have statsmodels package installed, it is used internally by the pandas.stats.ols function.

Stefan Falk
  • 23,898
  • 50
  • 191
  • 378
Roman Pekar
  • 107,110
  • 28
  • 195
  • 197
  • 14
    Note that this is going to be deprecated in future version of pandas! – denfromufa Apr 04 '16 at 17:48
  • 2
    `The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://www.statsmodels.org/stable/regression.html` – WestCoastProjects Jan 25 '17 at 00:47
  • 2
    @DestaHaileselassieHagos . This may be due to issue with `missing intercepts`. The designer of the equivalent `R` package adjusts by removing the adjustment for the mean: https://stats.stackexchange.com/a/36068/64552 . . Other suggestions: `you can use sm.add_constant to add an intercept to the exog array` and use a dict: `reg = ols("y ~ x", data=dict(y=y,x=x)).fit()` – WestCoastProjects Jul 04 '17 at 22:38
  • there is a strange data item for C 42523. It is an outlier. It should probably be removed or imputed to the average less 425323 – Golden Lion Feb 12 '21 at 17:50
35

I don't know if this is new in sklearn or pandas, but I'm able to pass the data frame directly to sklearn without converting the data frame to a numpy array or any other data types.

from sklearn import linear_model

reg = linear_model.LinearRegression()
reg.fit(df[['B', 'C']], df['A'])

>>> reg.coef_
array([  4.01182386e-01,   3.51587361e-04])
3novak
  • 2,506
  • 1
  • 17
  • 28
  • 4
    Small diversion from the OP - but I found this particular answer very helpful, after appending `.values.reshape(-1, 1)` to the dataframe columns. For example: `x_data = df['x_data'].values.reshape(-1, 1)` and passing the `x_data` (and a similarly created `y_data`) np arrays into the `.fit()` method. – S3DEV Oct 24 '17 at 16:27
18

This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place.

No it doesn't, just convert to a NumPy array:

>>> data = np.asarray(df)

This takes constant time because it just creates a view on your data. Then feed it to scikit-learn:

>>> from sklearn.linear_model import LinearRegression
>>> lr = LinearRegression()
>>> X, y = data[:, 1:], data[:, 0]
>>> lr.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
>>> lr.coef_
array([  4.01182386e-01,   3.51587361e-04])
>>> lr.intercept_
14.952479503953672
Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • 3
    I had to do `np.matrix( np.asarray( df ) )`, because sklearn expected a vertical vector, whereas numpy arrays, once you slice them off an array, act like horizontal vecotrs, which is great most of the time. – cjohnson318 Jan 08 '14 at 20:05
  • no simple way to do tests of the coefficients with this route, however – MichaelChirico Nov 26 '14 at 02:29
  • 2
    Isn't there a way to directly feed Scikit-Learn with Pandas DataFrame ? – Femto Trader Apr 03 '15 at 15:15
  • for other sklearn modules (decision tree, etc), I've used df['colname'].values, but that didn't work for this. – szeitlin Apr 29 '15 at 23:50
  • 1
    You could also use the `.values` attribute. I.e., `reg.fit(df[['B', 'C']].values, df['A'].values)`. – 3novak Jan 07 '17 at 02:47
18

Statsmodels kan build an OLS model with column references directly to a pandas dataframe.

Short and sweet:

model = sm.OLS(df[y], df[x]).fit()


Code details and regression summary:

# imports
import pandas as pd
import statsmodels.api as sm
import numpy as np

# data
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=list('ABC'))

# assign dependent and independent / explanatory variables
variables = list(df.columns)
y = 'A'
x = [var for var in variables if var not in y ]

# Ordinary least squares regression
model_Simple = sm.OLS(df[y], df[x]).fit()

# Add a constant term like so:
model = sm.OLS(df[y], sm.add_constant(df[x])).fit()

model.summary()

Output:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      A   R-squared:                       0.019
Model:                            OLS   Adj. R-squared:                 -0.001
Method:                 Least Squares   F-statistic:                    0.9409
Date:                Thu, 14 Feb 2019   Prob (F-statistic):              0.394
Time:                        08:35:04   Log-Likelihood:                -484.49
No. Observations:                 100   AIC:                             975.0
Df Residuals:                      97   BIC:                             982.8
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         43.4801      8.809      4.936      0.000      25.996      60.964
B              0.1241      0.105      1.188      0.238      -0.083       0.332
C             -0.0752      0.110     -0.681      0.497      -0.294       0.144
==============================================================================
Omnibus:                       50.990   Durbin-Watson:                   2.013
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                6.905
Skew:                           0.032   Prob(JB):                       0.0317
Kurtosis:                       1.714   Cond. No.                         231.
==============================================================================

How to directly get R-squared, Coefficients and p-value:

# commands:
model.params
model.pvalues
model.rsquared

# demo:
In[1]: 
model.params
Out[1]:
const    43.480106
B         0.124130
C        -0.075156
dtype: float64

In[2]: 
model.pvalues
Out[2]: 
const    0.000003
B        0.237924
C        0.497400
dtype: float64

Out[3]:
model.rsquared
Out[2]:
0.0190
Bill
  • 44,502
  • 24
  • 122
  • 213
vestland
  • 55,229
  • 37
  • 187
  • 305
0

B is not statistically significant. The data is not capable of drawing inferences from it. C does influence B probabilities

 df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})

 avg_c=df['C'].mean()
 sumC=df['C'].apply(lambda x: x if x<avg_c else 0).sum()
 countC=df['C'].apply(lambda x: 1 if x<avg_c else None).count()
 avg_c2=sumC/countC
 df['C']=df['C'].apply(lambda x: avg_c2 if x >avg_c else x)
 
 print(df)

 model_ols = smf.ols("A ~ B+C",data=df).fit()

 print(model_ols.summary())

 df[['B','C']].plot()
 plt.show()


 df2=pd.DataFrame()
 df2['B']=np.linspace(10,50,10)
 df2['C']=30

 df3=pd.DataFrame()
 df3['B']=np.linspace(10,50,10)
 df3['C']=100

 predB=model_ols.predict(df2)
 predC=model_ols.predict(df3)
 plt.plot(df2['B'],predB,label='predict B C=30')
 plt.plot(df3['B'],predC,label='predict B C=100')
 plt.legend()
 plt.show()

 print("A change in the probability of C affects the probability of B")

 intercept=model_ols.params.loc['Intercept']
 B_slope=model_ols.params.loc['B']
 C_slope=model_ols.params.loc['C']
 #Intercept    11.874252
 #B             0.760859
 #C            -0.060257

 print("Intercept {}\n B slope{}\n C    slope{}\n".format(intercept,B_slope,C_slope))


 #lower_conf,upper_conf=np.exp(model_ols.conf_int())
 #print(lower_conf,upper_conf)
 #print((1-(lower_conf/upper_conf))*100)

 model_cov=model_ols.cov_params()
 std_errorB = np.sqrt(model_cov.loc['B', 'B'])
 std_errorC = np.sqrt(model_cov.loc['C', 'C'])
 print('SE: ', round(std_errorB, 4),round(std_errorC, 4))
 #check for statistically significant
 print("B z value {} C z value {}".format((B_slope/std_errorB),(C_slope/std_errorC)))
 print("B feature is more statistically significant than C")


 Output:

 A change in the probability of C affects the probability of B
 Intercept 11.874251554067563
 B slope0.7608594144571961
 C slope-0.060256845997223814

 Standard Error:  0.4519 0.0793
 B z value 1.683510336937001 C z value -0.7601036314930376
 B feature is more statistically significant than C

 z>2 is statistically significant     
Golden Lion
  • 3,840
  • 2
  • 26
  • 35