9

I'm trying to train a very simple linear regression model.

My code is:

from scipy import stats

xs = [[   0,    1,  153]
 [   1,    2,    0]
 [   2,    3,  125]
 [   3,    1,   93]
 [   2,   24, 5851]
 [   3,    1,  524]
 [   4,    1,    0]
 [   2,    3,    0]
 [   2,    1,    0]
 [   5,    1,    0]]

ys = [1, 1, 1, 1, 1, 0, 1, 1, 0, 1]

slope, intercept, r_value, p_value, std_err = stats.linregress(xs, ys)

I'm getting the following error:

File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/stats/stats.py", line 3100, in linregress
ssxm, ssxym, ssyxm, ssym = np.cov(x, y, bias=1).flat
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/lib/function_base.py", line 1747, in cov
X = concatenate((X, y), axis)
ValueError: all the input array dimensions except for the concatenation 
axis must match exactly

What's wrong with my input? I've tried changing the structure of ys in several ways but nothing works.

jbrown
  • 7,518
  • 16
  • 69
  • 117
  • What do you want to achieve' – Abdul Fatir Jun 23 '16 at 08:14
  • I want to get rid of the error so I can train this over my full training dataset. It seems to be complaining that the two arrays don't have the same dimensions, but they are both 10 elements long. xs represents page views, duration on pages and page section, ys correspond to known genders. I'm trying to create a model to predict gender based on web site viewing behaviour. – jbrown Jun 23 '16 at 08:16
  • http://stackoverflow.com/questions/18404077/concatenating-arrays-in-python-like-matlab-without-knowing-the-size-of-the-outpu – Destrif Jun 23 '16 at 08:17
  • @Destrif how's that relevant? – jbrown Jun 23 '16 at 08:17
  • Like use hstack if you cannot get rid of concatenate errors, or show us value in X, y and axis maybe? – Destrif Jun 23 '16 at 08:19
  • Linear Regression is between two variables say `x` and `y`. What you have as `x` is a list of 3 variables vs y. This won't work. – Abdul Fatir Jun 23 '16 at 08:20
  • 1
    @Destrif that's not my code. That's an error from scipy. I need to know how to pass the parameters into the `linregress` function to make it happy. I've never worked with it or numpy before... – jbrown Jun 23 '16 at 08:21
  • @AbdulFatir This is multi-variable linear regression. Simple linear regression is between 2 variables. This is between a vector and a dependent variable. See https://en.wikipedia.org/wiki/Linear_regression#Simple_and_multiple_regression – jbrown Jun 23 '16 at 08:22

1 Answers1

9

You're looking for multi variable regression. AFAIK stats.linregress does not have that functionality.

You might want to try sklearn.linear_model.LinearRegression. Check this answer.

Community
  • 1
  • 1
Abdul Fatir
  • 6,159
  • 5
  • 31
  • 58
  • Oh maybe I misunderstood the docs for scipy. I thought it supported it. Let me give sckit a try... – jbrown Jun 23 '16 at 08:25
  • 2
    Or statsmodels's OLS – ev-br Jun 23 '16 at 11:52
  • statmodel's OLS is the correct answer. scikit gives you much less of the statistics behind what is going on, and if you're doing linear regression in the first place, that's probably why you are doing it (pvalue/ttests/etc. not the "machine learning" answer.) – rawkintrevo Jan 24 '20 at 14:11