1

I have one column in a csv which are the names of fruits which I want to convert into an array.

Sample csv column:

Names:
Apple
Banana
Pear
Watermelom
Jackfruit
..
..
..

There are around 400 fruit names in the column

I have used one hot encoding for the same but unable to display the column names(each fruit name from a row of the csv column)

My code till now is:

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

dataset = pd.read_csv('D:/fruits.csv')
X= dataset.iloc[:, 0].values


labelencoder_X = LabelEncoder()
D= labelencoder_X.fit_transform(X)
D = D.reshape(-1, 1)

onehotencoder = OneHotEncoder(sparse=False, categorical_features = [0])
X = onehotencoder.fit_transform(D)

This converts the data of the column into a numpy array but the columns names are coming as [0 1 2 3 .. ..] which I want as each row name of the csv, example [Apple Banana Pear Watermelon .. .. ]

How can I retain the column names after using one hot encoding

Lalit
  • 79
  • 6
  • can you add your current output & desired output in question? – Furqan Hashim Jul 04 '20 at 16:27
  • `.values` changes dataframe to numpy array which doesn't support string column names. You can try `X = pd.DataFrame(X, columns = dataset.columns)` – Sachin Prabhu Jul 04 '20 at 16:44
  • @SachinPrabhu I am getting the error "ValueError: Shape of passed values is (1, 68197), indices imply (3, 68197)" – Lalit Jul 04 '20 at 17:16
  • 1
    Does this answer your question? [Feature names from OneHotEncoder](https://stackoverflow.com/questions/54570947/feature-names-from-onehotencoder) – Ben Reiniger Jul 04 '20 at 18:14

1 Answers1

2

Orignal Answer:

A rather efficient way to OneHotEncode would be to use pd.get_dummies. I've applied on sample data:

data = {'Names':['Apple','Banana','Pear', 'Watermelon']}
df = pd.DataFrame(data=data)

df_new = pd.get_dummies(df)
print(df_new) 

Orignal df:

        Names
0       Apple
1      Banana
2        Pear
3  Watermelon

Encoded df:

   Names_Apple  Names_Banana  Names_Pear  Names_Watermelon
0            1             0           0                 0
1            0             1           0                 0
2            0             0           1                 0
3            0             0           0                 1

Edit:

Let's assume that our dataframe contains 2 Categorical & 2 Numeric features. We just want to OneHotEncode 1 of the 2 Categorical columns.

Generating dummy Data:

data = {'Names':['Apple','Banana','Pear', 'Watermelom'],
        'Category' :['A','B','A','B'],
        'Val1':[10,20,30,30],
        'Val2':[60,70,80,90]}
df = pd.DataFrame(data=data)

        Names Category  Val1  Val2
0       Apple        A    10    60
1      Banana        B    20    70
2        Pear        A    30    80
3  Watermelom        B    30    90

If we just want to OneHotEncode Names we would do that by

df_new = pd.get_dummies(df, columns=['Names'])
print(df_new)

You can refer to this documentation. By defining columns we would only encode columns of interest.

Encoded Output:

  Category  Val1  Val2  Names_Apple  Names_Banana  Names_Pear  Names_Watermelom
0        A    10    60            1             0           0                 0
1        B    20    70            0             1           0                 0
2        A    30    80            0             0           1                 0
3        B    30    90            0             0           0                 1
Furqan Hashim
  • 1,304
  • 2
  • 15
  • 35
  • Hi Furqan. I cannot create the data variable manually like that as there are around 400 items under the Names column. Any suggestions on how to tackle that? – Lalit Jul 04 '20 at 17:12
  • If 400 items are in a column of pandas dataframe the above code should work. Have you tried the code in solution? – Furqan Hashim Jul 04 '20 at 17:16
  • My doubt is `data = {'Names':['Apple','Banana','Pear', 'Watermelon']}` contains only 4 items but it should contains 400 fruit names from the csv column. – Lalit Jul 04 '20 at 17:19
  • I assume you are reading a csv which you’ve named as dataset. Replace df in 2nd last line of code with dataset. I’ve created data just to show an example. – Furqan Hashim Jul 04 '20 at 17:21
  • Let me try that. Also, I originally have 3 columns in the csv but I only want to convert 1 column into an array i.e the Names column. `data= pd.read_csv('D:/fruits.csv') data = data[:, 0] df = pd.DataFrame(data=data)` I am getting an error "TypeError: unhashable type: 'slice'" in this case – Lalit Jul 04 '20 at 17:29
  • Use data=data.iloc[:,0] new_data=pd.get_dummies(data) – Furqan Hashim Jul 04 '20 at 17:32
  • Thanks. That worked. Any idea on if I want to display the corresponding column data of the next row in the initial csv in-place of the newly generated 'Index' column – Lalit Jul 04 '20 at 17:54
  • Yes, if you can provide the csv or sample data then I can give you a much better solution. – Furqan Hashim Jul 04 '20 at 17:57
  • Thanks mate. Let me put things in order in a new question and I will share the link with you here so that you can help me out maybe – Lalit Jul 04 '20 at 18:15
  • Check the edit, I think it explains what you are trying to achieve. Please upvote if you got the desired results. – Furqan Hashim Jul 04 '20 at 21:29