3

I see that similar questions have been asked, but it doesn't look like those were caused by the same problem. Here is my code that gives the error:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from io import StringIO
d = pd.read_csv("http://www.stat.wisc.edu/~jgillett/451/data/kaggle_titanic_train.csv")
data  =d[['Survived','Pclass','Sex','Age','SibSp','Parch']]
#print(data.head(n=7))
y = data.Survived
X = data[['Pclass','Sex','Age','SibSp','Parch']]
k = 3
knn = KNeighborsClassifier(n_neighbors=k, weights='distance', metric='euclidean')
knn.fit(X, y)

So I tried to convert it to float like this:

data.Sex=data[['Sex']].astype(float)

But that just gives the exact same error. Why is it not able to convert the string to float?

Johnny
  • 59
  • 5
  • For starters, the string type can NOT be converted to float value. Additionally, when sharing errors on SO, it's best to pass the actual error produced by the python interpreter. – MedoAlmasry Mar 31 '23 at 02:53
  • you can change that categorical data datatyped string to numerical by one hot encoding or label encoding – biyazelnut Mar 31 '23 at 02:56

1 Answers1

3

You can use replace or pd.factorize:

data['Sex'] = data['Sex'].replace({'male': 0, 'female': 1})

# OR

data['Sex'] = pd.factorize(data['Sex'])[0]

Output:

>>> data
     Survived  Pclass  Sex   Age  SibSp  Parch
0           0       3    0  22.0      1      0
1           1       1    1  38.0      1      0
2           1       3    1  26.0      0      0
3           1       1    1  35.0      1      0
4           0       3    0  35.0      0      0
..        ...     ...  ...   ...    ...    ...
886         0       2    0  27.0      0      0
887         1       1    1  19.0      0      0
888         0       3    1   NaN      1      2
889         1       1    0  26.0      0      0
890         0       3    0  32.0      0      0

[891 rows x 6 columns]

Important note

To prevent SettingWithCopyWarning, use:

url = 'http://www.stat.wisc.edu/~jgillett/451/data/kaggle_titanic_train.csv'
cols = ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch']
data = pd.read_csv(url, usecols=cols)

# OR

data = d[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch']].copy()
Corralien
  • 109,409
  • 8
  • 28
  • 52