Display feature names in columns after using One Hot encoding

Question

I have one column in a csv which are the names of fruits which I want to convert into an array.

Sample csv column:

Names:
Apple
Banana
Pear
Watermelom
Jackfruit
..
..
..

There are around 400 fruit names in the column

I have used one hot encoding for the same but unable to display the column names(each fruit name from a row of the csv column)

My code till now is:

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

dataset = pd.read_csv('D:/fruits.csv')
X= dataset.iloc[:, 0].values


labelencoder_X = LabelEncoder()
D= labelencoder_X.fit_transform(X)
D = D.reshape(-1, 1)

onehotencoder = OneHotEncoder(sparse=False, categorical_features = [0])
X = onehotencoder.fit_transform(D)

This converts the data of the column into a numpy array but the columns names are coming as [0 1 2 3 .. ..] which I want as each row name of the csv, example [Apple Banana Pear Watermelon .. .. ]

How can I retain the column names after using one hot encoding

can you add your current output & desired output in question? — Furqan Hashim, Jul 04 '20 at 16:27
`.values` changes dataframe to numpy array which doesn't support string column names. You can try `X = pd.DataFrame(X, columns = dataset.columns)` — Sachin Prabhu, Jul 04 '20 at 16:44
@SachinPrabhu I am getting the error "ValueError: Shape of passed values is (1, 68197), indices imply (3, 68197)" — Lalit, Jul 04 '20 at 17:16
Does this answer your question? [Feature names from OneHotEncoder](https://stackoverflow.com/questions/54570947/feature-names-from-onehotencoder) — Ben Reiniger, Jul 04 '20 at 18:14

Furqan Hashim · Accepted Answer · 2020-07-04T21:28:16.677

2

Orignal Answer:

A rather efficient way to OneHotEncode would be to use pd.get_dummies. I've applied on sample data:

data = {'Names':['Apple','Banana','Pear', 'Watermelon']}
df = pd.DataFrame(data=data)

df_new = pd.get_dummies(df)
print(df_new)

Orignal df:

        Names
0       Apple
1      Banana
2        Pear
3  Watermelon

Encoded df:

   Names_Apple  Names_Banana  Names_Pear  Names_Watermelon
0            1             0           0                 0
1            0             1           0                 0
2            0             0           1                 0
3            0             0           0                 1

Edit:

Let's assume that our dataframe contains 2 Categorical & 2 Numeric features. We just want to OneHotEncode 1 of the 2 Categorical columns.

Generating dummy Data:

data = {'Names':['Apple','Banana','Pear', 'Watermelom'],
        'Category' :['A','B','A','B'],
        'Val1':[10,20,30,30],
        'Val2':[60,70,80,90]}
df = pd.DataFrame(data=data)

        Names Category  Val1  Val2
0       Apple        A    10    60
1      Banana        B    20    70
2        Pear        A    30    80
3  Watermelom        B    30    90

If we just want to OneHotEncode Names we would do that by

df_new = pd.get_dummies(df, columns=['Names'])
print(df_new)

You can refer to this documentation. By defining columns we would only encode columns of interest.

Encoded Output:

  Category  Val1  Val2  Names_Apple  Names_Banana  Names_Pear  Names_Watermelom
0        A    10    60            1             0           0                 0
1        B    20    70            0             1           0                 0
2        A    30    80            0             0           1                 0
3        B    30    90            0             0           0                 1

edited Jul 04 '20 at 21:28

answered Jul 04 '20 at 16:43

Furqan Hashim

1,304
2
15
35

Hi Furqan. I cannot create the data variable manually like that as there are around 400 items under the Names column. Any suggestions on how to tackle that? – Lalit Jul 04 '20 at 17:12
If 400 items are in a column of pandas dataframe the above code should work. Have you tried the code in solution? – Furqan Hashim Jul 04 '20 at 17:16
My doubt is `data = {'Names':['Apple','Banana','Pear', 'Watermelon']}` contains only 4 items but it should contains 400 fruit names from the csv column. – Lalit Jul 04 '20 at 17:19
I assume you are reading a csv which you’ve named as dataset. Replace df in 2nd last line of code with dataset. I’ve created data just to show an example. – Furqan Hashim Jul 04 '20 at 17:21
Let me try that. Also, I originally have 3 columns in the csv but I only want to convert 1 column into an array i.e the Names column. `data= pd.read_csv('D:/fruits.csv') data = data[:, 0] df = pd.DataFrame(data=data)` I am getting an error "TypeError: unhashable type: 'slice'" in this case – Lalit Jul 04 '20 at 17:29
Use data=data.iloc[:,0] new_data=pd.get_dummies(data) – Furqan Hashim Jul 04 '20 at 17:32
Thanks. That worked. Any idea on if I want to display the corresponding column data of the next row in the initial csv in-place of the newly generated 'Index' column – Lalit Jul 04 '20 at 17:54
Yes, if you can provide the csv or sample data then I can give you a much better solution. – Furqan Hashim Jul 04 '20 at 17:57
Thanks mate. Let me put things in order in a new question and I will share the link with you here so that you can help me out maybe – Lalit Jul 04 '20 at 18:15
Check the edit, I think it explains what you are trying to achieve. Please upvote if you got the desired results. – Furqan Hashim Jul 04 '20 at 21:29

Display feature names in columns after using One Hot encoding

1 Answers1