"ValueError: could not convert string to float" error in scikit-learn

Question

I'm running the following script:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
dataset = pd.read_csv('data/50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
onehotencoder = OneHotEncoder(categorical_features=3, 
handle_unknown='ignore')
onehotencoder.fit(X)

The data head looks like: data

And I've got this:

ValueError: could not convert string to float: 'New York'

I read the answers to similar questions and then opened scikit-learn documentations, but how you can see scikit-learn authors doesn't have issues with spaces in strings

I know that I can use LabelEncocder from sklearn.preprocessing and then use OHE and it works well, but in that case

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
warnings.warn(msg, FutureWarning)

massage occurs.

You can use full csv file or

[[165349.2, 136897.8, 471784.1, 'New York', 192261.83],
[162597.7, 151377.59, 443898.53, 'California', 191792.06],
[153441.51, 101145.55, 407934.54, 'Florida', 191050.39],
[144372.41, 118671.85, 383199.62, 'New York', 182901.99],
[142107.34, 91391.77, 366168.42, 'Florida', 166187.94]]

5 first lines to test this code.

try: dataset.info() to check the types of data that you have in your dataframe. — Jorge, Nov 26 '18 at 00:20
I've add 5 first lines and link to pastebin with full content of the file — Aziz Temirkhanov, Nov 26 '18 at 00:29
The 'State' column full of 50 non-null objects. Now I see the problem, but anyway have no idea how to fix it without using `LabelEncoder` — Aziz Temirkhanov, Nov 26 '18 at 00:31
What would you expect 'New York' to be as a floating point number? *Why* would you think it has anything to do with a space in the string? — Jared Smith, Nov 26 '18 at 00:33

DYZ · Accepted Answer · 2018-11-26T01:50:53.580

4

It is categorical_features=3 that hurts you. You cannot use categorical_features with string data. Remove this option, and luck will be with you. Also, you probably need fit_transform, not fit as such.

onehotencoder = OneHotEncoder(handle_unknown='ignore')
transformed = onehotencoder.fit_transform(X[:, [3]]).toarray()
X1 = np.concatenate([X[:, :2], transformed, X[:, 4:]], axis=1)
#array([[165349.2, 136897.8, 0.0, '0.0, 1.0, 192261.83],
#       [162597.7, 151377.59, 1.0, 0.0, 0.0, 191792.06],
#       [153441.51, 101145.55, 0.0, 1.0, 0.0, 191050.39],
#       [144372.41, 118671.85, 0.0, 0.0, 1.0, 182901.99],
#       [142107.34, 91391.77, 0.0, 1.0, 0.0, 166187.94']])

edited Nov 26 '18 at 01:50

answered Nov 26 '18 at 00:56

DYZ

55,249
10
64
93

In that case the whole dataset tranforms to categorical data, not only 3d column – Aziz Temirkhanov Nov 26 '18 at 00:57
You can choose which columns to transform. – DYZ Nov 26 '18 at 00:58
I ran this code: `onehotencoder = OneHotEncoder(handle_unknown='ignore') onehotencoder.fit(X[:, 3])` and got this error: `ValueError: Expected 2D array, got 1D array instead:` – Aziz Temirkhanov Nov 26 '18 at 01:08
1

Because you pass a 1D array instead of a 2D array. You ought to pass `X[:, [3]]` or `X[:,3].reshape(1,-1)`. – DYZ Nov 26 '18 at 01:24
OK, I did it like you said. Now if I apply this `X = onehotencoder.transform(X[:, [3]]).toarray()` I losing my first 3 colums. If I apply this `X = onehotencoder.transform(X[:, 3]).toarray()` the same error occurs – Aziz Temirkhanov Nov 26 '18 at 01:30
1

You have to combine the transformed columns with the original columns. I am afraid that your understanding of how Python (and Numpy) works is still insufficient for carrying out complex tasks, and strongly suggest that you read a good numpy tutorial. – DYZ Nov 26 '18 at 01:35
Sorry, I've been facing this problem whitin 3 hours and it 4.35 a.m. so I'm bit tired. Could you please past the whole working code that transforms my 4 colums to 6 colums correctly? – Aziz Temirkhanov Nov 26 '18 at 01:37
Following your advice I cheked how np.concatinate works and get the same result, but your code looks cleaner, thanks for carring me – Aziz Temirkhanov Nov 26 '18 at 01:57

score 0 · Answer 2 · answered Dec 17 '18 at 02:59

Try this:

from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import OneHotEncoder

columntransformer = make_column_transformer(
(OneHotEncoder(categories='auto'), [3]),
    remainder='passthrough')


X = columntransformer.fit_transform(X)
X = X.astype(float)

"ValueError: could not convert string to float" error in scikit-learn

2 Answers2

Linked