sklearn-LinearRegression: could not convert string to float: '--'

Question

I am trying to use a LinearRegression from sklearn and I am getting a 'Could not convert a string to float'. All columns of the dataframe are float and the output y is also float. I have looked at other posts and the suggestions are to convert to float which I have done.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 789 entries, 158 to 684
Data columns (total 8 columns):
f1     789 non-null float64
f2     789 non-null float64
f3     789 non-null float64
f4     789 non-null float64
f5     789 non-null float64
f6     789 non-null float64
OFF    789 non-null uint8
ON     789 non-null uint8
dtypes: float64(6), uint8(2)
memory usage: 44.7 KB

type(y_train)
pandas.core.series.Series
type(y_train[0])
float

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,Y,random_state=0)
X_train.head()
from sklearn.linear_model import LinearRegression
linreg = LinearRegression().fit(X_train, y_train)

The error I get is a

ValueError                                Traceback (most recent call last)
<ipython-input-282-c019320f8214> in <module>()
      6 X_train.head()
      7 from sklearn.linear_model import LinearRegression
----> 8 linreg = LinearRegression().fit(X_train, y_train)
510         n_jobs_ = self.n_jobs
    511         X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
--> 512                          y_numeric=True, multi_output=True)
    513 
    514         if sample_weight is not None and np.atleast_1d(sample_weight).ndim > 1:

 527         _assert_all_finite(y)
    528     if y_numeric and y.dtype.kind == 'O':
--> 529         y = y.astype(np.float64)
    530 
    531     check_consistent_length(X, y)

ValueError: could not convert string to float: '--'

Please help.

cs95 · Accepted Answer · 2017-09-07T10:00:16.557

10

A quick solution would involve using pd.to_numeric to convert whatever strings your data might contain to numeric values. If they're incompatible with conversion, they'll be reduced to NaNs.

from sklearn.linear_model import LinearRegression

X = X.apply(pd.to_numeric, errors='coerce')
Y = Y.apply(pd.to_numeric, errors='coerce')

Furthermore, you can choose to fill those values with some default:

X.fillna(0, inplace=True)
Y.fillna(0, inplace=True)

Replace the fill value with whatever's relevant to your problem. I don't recommend dropping these rows, because you might end up dropping different rows from X and Y causing a data-label mismatch.

Finally, split and call your classifier:

X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=0)
clf = LinearRegression().fit(X_train, y_train)

edited Sep 07 '17 at 10:00

answered Sep 07 '17 at 09:41

cs95

379,657
97
704
746

But if they become `Nan`s LinearRegression.fit() will still throw an error. – Vivek Kumar Sep 07 '17 at 09:49
@VivekKumar I don't know what OP wants to do with those NaNs... maybe drop them? Fill them? I'll edit on further clarification. – cs95 Sep 07 '17 at 09:50
Aah ok. So this will verify that the data OP has is actually good or not. Thanks – Vivek Kumar Sep 07 '17 at 09:51
1

@ColdSpeed Thanks! That helped! – Tinniam V. Ganesh Sep 07 '17 at 10:42

score 3 · Answer 2 · answered Aug 04 '19 at 18:44

3

I think its better to convert all the string columns to binary(0,1) using the label encoding or one hot encoding after than our linear regression will behave much better.!!

answered Aug 04 '19 at 18:44

Sagar Narula

31
2

score 0 · Answer 3 · edited May 10 '22 at 21:34

It is because one of your columns contains string values. I had the same problem, because I've been ask to drop a column, but I didn't have to, because the columns were already deleted.

However, after doing this code :

model = LogisticRegressionCV(solver='lbfgs', cv=5, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

I have this error :

could not convert string to float: 'product_mng'

The reason is that X_train still had the string column, which I thought was deleted. As a conclusion, check AGAIN that ALL your column are not string. If there is one, delete it with pd.drop, or label encode (or 1-hot encode) this string column.

sklearn-LinearRegression: could not convert string to float: '--'

3 Answers3

Linked