I am trying to predict y
values based on X
values. I have a Excel file which has how many Siblings and Spouses a person has. The file also contains a survival outcome which is y
(1 = Survived, 0 = Died).
The code snippet below shows how I do this
dataSet = pd.read_excel("TitanicData.xlsx", sheet_name="TitanicData")
dataSet.head()
dataSet.columns
SibSp = dataSet.iloc[:, 6]
Parch = dataSet.iloc[:, 7]
Stack = np.column_stack((SibSp, Parch, SibSp + Parch))
Family = pd.DataFrame(Stack, columns=['SibSp', 'Parch', 'Family'])
X = Family.iloc[:, 2]
y = dataSet.iloc[:, 1]
This now gives me the correct values I expect, y
is a DataFrame of 1's and 0's depicting if the person died or not, X
holds the sum of SibSp
and Parch
columns.
I then split the data into training and testing dataframes which is done like so (update to show where X_train, X_test derives from)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
However, when I then try to use sklearn.linear_model.LinearRegression
, I start getting errors
classifier = LinearRegression()
classifier.fit(X_train, y_train)
classifier.predict(X_test)
ValueError: Expected 2D array, got 1D array instead: array=[ 1 2 0 1 0 0 0 0 4 ...] Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I tried taking a look at similar questions on SO but the line throwing this exception is
classifier.fit(X_train, y_train)
How can I fit my training values into my classifier?
Update:
print(X_train.shape, y_train.values.reshape(-1,1).shape)
Gives me (534,) (534, 1)
Update to show full debug trace
File "<ipython-input-56-2da0ffaf5447>", line 1, in <module>
train()
File "C:/Users/user/Desktop/dantitanic/AnotherTest.py", line 41, in train
classifier.fit(X_train, y_train)
File "C:\Users\user\Anaconda3\lib\site-packages\sklearn\linear_model\base.py", line 458, in fit
y_numeric=True, multi_output=True)
File "C:\Users\user\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 756, in check_X_y
estimator=estimator)
File "C:\Users\user\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 552, in check_array
"if it contains a single sample.".format(array))