Why does scikit-learn demand different data shapes for different regressors?

Question

I always find myself reshaping my data when I'm working with sklearn, and it's irritating and makes my code ugly. Why can't the library be made to work with a variety of data shapes, interpreting appropriately? For example, to work with a linear regressor I need to do

from sklearn.linear_model import LinearRegression
x = np.random.rand(10).reshape(-1,1)
y = np.random.rand(10).reshape(-1,1)
regr = LinearRegression()
regr.fit(x,y)

but if I want to use a support vector regressor, then I don't reshape the independent variable:

from sklearn.svm import SVR
x = np.random.rand(10).reshape(-1,1)
y = np.random.rand(10)
regr = SVR()
regr.fit(x,y)

I presume there is some reason why the library is designed in this way; can anyone illuminate me?

I dont get any errors when using `x = np.random.rand(10).reshape(-1,1), y = np.random.rand(10)` with any of your specified estimators. Both `SVR` and `LinearRegression` can take y with (n,) or (n,1). — Vivek Kumar, Feb 01 '17 at 06:19
So you can run the above with `x=np.random.rand(10)` and `y=np.random.rand(10)`? I get a `ValueError` when I try to do that. What version of `scikit-learn` are you using? — Peter Wills, Feb 01 '17 at 17:27
No. Not `X`. `X` must always be a 2-d vector of `[n_samples, n_features]`. I was talking about `y` (which is the only different code in your snippet above). `y` can be a column vector `[n_samples,1]` or simply `[n_samples,]`. — Vivek Kumar, Feb 02 '17 at 05:35
I get a `DataConversionWarning` when I use `SVR` with a column vector where `y.shape = (n_samples,1)`. As for `X`, I'm still unclear on why the `sklearn` doesn't automatically understand that if I pass it something of the shape `(n,)` that `n_samples=n` and `n_features=1`. — Peter Wills, Feb 04 '17 at 05:06
Yes, I do get a warning for y. For more clarity, I have added an answer. Hope it helps — Vivek Kumar, Feb 06 '17 at 09:04

Vivek Kumar · Accepted Answer · 2017-02-06T09:33:44.310

When you do y = np.random.rand(10), y is a one dimensional array of [10,]. It doesnt matter if its a row vector or column vector. Its just a vector with only one dimension. Take a look at this answer and this too to understand the philosophy behind it.

Its a part of "numpy philosophy". And sklearn depends on numpy.

As for your comment:-

why sklearn doesn't automatically understand that if I pass it something of the shape (n,) that n_samples=n and n_features=1

sklearn may not infer whether its n_samples=n and n_features=1 or other way around (n_samples=1 and n_features=n) based on X data alone. It may be done, if y is passed which may make it clear about the n_samples.

But that means changing all the code which relies on this type of semantics and that may break many things, because sklearn depends on numpy operations heavily.

You may also want to check the following links where similar issues are discussed.

Why does scikit-learn demand different data shapes for different regressors?

1 Answers1

Linked