0

This is the code

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
print(train_data.head())
print('\nShape of training data :',train_data.shape)
print('\nShape of testing data :',test_data.shape)
train_x = train_data.drop(columns=['pHSWS25'],axis=1)
train_y = train_data['pHSWS25']
print train_x.head()
print train_y.head()
LinearRegression().fit(train_x,train_y)

When I run it I get:

   Section  Longitude  Latitude  ...  Alkalinity  pHSWS25    TCO2
0  06GA19960613      64.87     81.38  ...      2236.3  7.79776  2056.6
1  06GA19960613      64.87     81.38  ...      2234.4  7.78997  2068.4
2  06GA19960613      64.87     81.38  ...      2247.1  7.74140  2104.1
3  06GA19960613      64.87     81.38  ...      2254.1  7.71428  2120.5
4  06GA19960613      64.87     81.38  ...      2270.4  7.69494  2131.7

[5 rows x 18 columns]
('\nShape of training data :', (87099, 18))
('\nShape of testing data :', (171921, 18))
////////////////////////
        Section  Longitude  Latitude  ...  Phosphate  Alkalinity    TCO2
0  06GA19960613      64.87     81.38  ...   0.214634      2236.3  2056.6
1  06GA19960613      64.87     81.38  ...   0.253659      2234.4  2068.4
2  06GA19960613      64.87     81.38  ...   0.390244      2247.1  2104.1
3  06GA19960613      64.87     81.38  ...   0.536585      2254.1  2120.5
4  06GA19960613      64.87     81.38  ...   0.595122      2270.4  2131.7   
[5 rows x 17 columns]
0    7.79776
1    7.78997
2    7.74140
3    7.71428
4    7.69494

The error:

Name: pHSWS25, dtype: float64
Traceback (most recent call last):
  File "ocean_data.py", line 60, in <module>
    LinearRegression().fit(train_x,train_y)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/base.py", line 458, in fit
    y_numeric=True, multi_output=True)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 756, in check_X_y
    estimator=estimator)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 567, in check_array
    array = array.astype(np.float64)
ValueError: invalid literal for float(): 06GA19960613

Could anyone help to solve this issue?

luis.parravicini
  • 1,214
  • 11
  • 19
Siham MB
  • 29
  • 10
  • You have a string which is incompatible with sklearn, you need to encode them first https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder – EdChum Nov 19 '19 at 10:02
  • How to encode it, I couldn t use the link – Siham MB Nov 19 '19 at 10:26

1 Answers1

0

Linear regression only accepts numerical features, if you run: train_data.dtypesyou will probably get back:

section     object
Longitude   float

You have to convert this, or use a different regression type.

One way to convert your file is called Encoding:

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit(train.data['Section']() 
train.data['Section'] =enc.transform(train.data['Section']).toarray()
test.data['Section'] = enc.transform(test.data['Section']).toarray()

Just as a starting point, if there occuring shape errors now, you have to play a little with the data format...

PV8
  • 5,799
  • 7
  • 43
  • 87