0

I want to one-hot encode the sex column in my dataframe df. I reshaped it using reshape(-1, 1) but still get an error Expected 2D array, got 1D array instead.

from sklearn.preprocessing import OneHotEncoder

df = pd.read_csv('C:/Users/User/Downloads/suicide_rates.csv')

df['sex'] = ohe.fit_transform(df['sex']).toarray().reshape(-1, 1)

Traceback

ValueError: Expected 2D array, got 1D array instead: array=['male' 'male' 'female' ... 'male' 'female' 'female']. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

melololo
  • 161
  • 2
  • 3
  • 12

1 Answers1

0

You are closing the parenthesis before actually reshaping the array.

The other problem is that the one hot encoder is creating several columns, you cannot assign that to a single column:

In [1]: from sklearn.preprocessing import OneHotEncoder

In [2]: import pandas as pd

In [3]: df = pd.DataFrame({"sex": list("mmmfff")})

In [4]: ohe = OneHotEncoder()

In [5]: ohe.fit_transform(df["sex"].to_numpy().reshape(-1, 1))
Out[5]: 
<9x3 sparse matrix of type '<class 'numpy.float64'>'
    with 9 stored elements in Compressed Sparse Row format>

In [6]: _.toarray()
Out[6]: 
array([[0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.]])

You see that we have 2 columns. If you are sure that you have only 2 values, you can use the drop parameter of OneHotEncoder, it will drop the first value, and you can assign that to the dataframe:

In [11]: ohe_with_drop = OneHotEncoder(drop="first")

In [12]: ohe_with_drop.fit_transform(df["sex"].to_numpy().reshape(-1, 1)).toarray()
Out[12]: 
array([[1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.]])
In [13]: df["sex_ohe"] = ohe_with_drop.fit_transform(df["sex"].to_numpy().reshape(-1, 1)).toarray()

In [14]: df
Out[14]: 
  sex  sex_ohe
0   m      1.0
1   m      1.0
2   m      1.0
3   f      0.0
4   f      0.0
5   f      0.0

See the scikit-learn documentation for more about one hot encoding.

As an alternative, you can use pandas.get_dummies:

In [18]: pd.get_dummies(df["sex"])
Out[18]: 
   f  m
0  0  1
1  0  1
2  0  1
3  1  0
4  1  0
5  1  0

See this answer for more.

FlorianGD
  • 2,336
  • 1
  • 15
  • 32
  • It raises `ValueError: Wrong number of items passed 2, placement implies 1` – melololo Sep 17 '21 at 06:44
  • Ah, yes, I replied too fast. You cannot assign to a column, as the `OneHotEncode` will create several columns. I'll edit the answer – FlorianGD Sep 17 '21 at 06:56