0

I am trying to predict y values based on X values. I have a Excel file which has how many Siblings and Spouses a person has. The file also contains a survival outcome which is y (1 = Survived, 0 = Died).

The code snippet below shows how I do this

dataSet = pd.read_excel("TitanicData.xlsx", sheet_name="TitanicData")
dataSet.head()
dataSet.columns

SibSp  = dataSet.iloc[:, 6]
Parch  = dataSet.iloc[:, 7]

Stack  = np.column_stack((SibSp, Parch, SibSp + Parch))
Family = pd.DataFrame(Stack, columns=['SibSp', 'Parch', 'Family'])

X      = Family.iloc[:, 2]
y      = dataSet.iloc[:, 1]

This now gives me the correct values I expect, y is a DataFrame of 1's and 0's depicting if the person died or not, X holds the sum of SibSp and Parch columns.

I then split the data into training and testing dataframes which is done like so (update to show where X_train, X_test derives from)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)

However, when I then try to use sklearn.linear_model.LinearRegression, I start getting errors

classifier = LinearRegression()

classifier.fit(X_train, y_train)
classifier.predict(X_test)

ValueError: Expected 2D array, got 1D array instead: array=[ 1 2 0 1 0 0 0 0 4 ...] Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

I tried taking a look at similar questions on SO but the line throwing this exception is

classifier.fit(X_train, y_train)

How can I fit my training values into my classifier?

Update:

print(X_train.shape, y_train.values.reshape(-1,1).shape)

Gives me (534,) (534, 1)

Update to show full debug trace

  File "<ipython-input-56-2da0ffaf5447>", line 1, in <module>
    train()

  File "C:/Users/user/Desktop/dantitanic/AnotherTest.py", line 41, in train
    classifier.fit(X_train, y_train)

  File "C:\Users\user\Anaconda3\lib\site-packages\sklearn\linear_model\base.py", line 458, in fit
    y_numeric=True, multi_output=True)

  File "C:\Users\user\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 756, in check_X_y
    estimator=estimator)

  File "C:\Users\user\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 552, in check_array
    "if it contains a single sample.".format(array))
Jaquarh
  • 6,493
  • 7
  • 34
  • 86
  • Read your error. It clearly states you have a different dimension data which you are feeding inside `fit`. Hint- Pass `1-D` as even error states how to. – meW Jan 08 '19 at 10:47
  • I tried `X.reshape(1, -1)` like it suggests giving me `'Series' object has no attribute 'reshape'`, I understand that it wants a 2D `[[], []]` array but I don't know how to replicate that error in code @meW – Jaquarh Jan 08 '19 at 10:48
  • `X.values.reshape(-1,1)` – meW Jan 08 '19 at 10:49
  • Gives me, `Found input variables with inconsistent numbers of samples: [1, 534]` :/ @meW – Jaquarh Jan 08 '19 at 10:50
  • in which line you are getting error? predict or fit? – Nihal Jan 08 '19 at 10:51
  • `classifier.fit(X_train, y_train)` according to my debug trace. I will update the question to provide the full trace. @Nihal – Jaquarh Jan 08 '19 at 10:51
  • @Jaquarh for a 1-D data it should be any number of rows but 1 col. See for yourself you stated `[1,534]` which means just the opposite. You get my point? Check the shape before passing into `fit` – meW Jan 08 '19 at 10:53
  • I'm new to Python, PHP is my first - and only language. From what I can gather, it works as `[rows, cols]`. I have been following a [scikit-learn](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py) tutorial to replicate this but with my own data. I am honestly trying to help you out more but I don't understand what `[1,534]` represents or even what the `shape` has to do with a 2D array since `(n, 2)` would be `[[0,0], [0,1]]` and so on, where as I assume I need `[[0], [1], [3]]` and so on @meW – Jaquarh Jan 08 '19 at 10:57
  • But replicating that as code, since Python isn't my playground, I don't understand how to deal with this error or what it expects me to do - the tutorial doesn't come across it. I could provide my full environment if it helps rather than the minimal but complete example? – Jaquarh Jan 08 '19 at 10:58
  • try `classifier.fit(X_train, y_train.values.reshape(-1,1))` BTW do show where have to initialized `X_train` and `y_train` – meW Jan 08 '19 at 11:00
  • Same error, `File "C:/Users/user/Desktop/dantitanic/AnotherTest.py", line 41, in train classifier.fit(X_train, y_train.values.reshape(-1,1))` with it being a 1-D array :/ I appreciate your help! Would providing my full environment help? ie: With my data set? @meW – Jaquarh Jan 08 '19 at 11:01
  • share the output of `print(X_train.shape, y_train.values.reshape(-1,1).shape)` – meW Jan 08 '19 at 11:03
  • I updated the question with where `X_train` derives from and the output of `print()` which is `(534,) (534, 1)` @meW – Jaquarh Jan 08 '19 at 11:05
  • great that's why I asked earlier to show me how are you initializing `X_train` and `y_train`. Go with `classifier.fit(X_train.values.reshape(-1,1), y_train.values.reshape(-1,1))` – meW Jan 08 '19 at 11:06
  • I think you A) Deserve a gold star for putting up with my Python noobness and B) Deserve a hug cos that fixed my issue! If you want to provide an answer to this I'll mark it as correct when I can - Thanks so much for your help and patience – Jaquarh Jan 08 '19 at 11:07
  • 1
    @Jaquarh keep learning. Since it's just a common query so no need for me to put it as an answer. – meW Jan 08 '19 at 11:10

1 Answers1

1

You need to reshape X_train and X_test before fitting like this:

X_train = X_train.reshape(1, -1)
X_test = X_test.reshape(1, -1)
QuarKUS7
  • 143
  • 6