1

I have a Pandas DataFrame df that stores some numeric values :

print(df)

       value 
0          0
1          2
2          4
3          5
4          8

And I have a function that converts a numerical value to a one-hot vector

print(to_categorical(0))
[1 0 0 0 0 0 0 0 0 0]

print(to_categorical(5))
[0 0 0 0 0 5 0 0 0 0]

etc...

So, I can call my function over my columns of numeric value :

print(to_categorical(df['value'))

[[1 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0]]

And now I want to store my results as a new column. Here is what I expect from my example :

df['one-hot'] = to_categorical(df['value')
print(df)

        value                    one-hot
0          0       [1 0 0 0 0 0 0 0 0 0]
1          2       [0 0 1 0 0 0 0 0 0 0]
2          4       [0 0 0 0 1 0 0 0 0 0]
3          5       [0 0 0 0 0 1 0 0 0 0]
4          8       [0 0 0 0 0 0 0 0 1 0]

But this give me an error since pandas tries to flatten my array into multiple colums. How can I do that ?

Nakeuh
  • 1,757
  • 3
  • 26
  • 65
  • `df['one-hot'] = to_categorical(df['value').tolist()` – Sreeram TP Mar 28 '19 at 09:53
  • Possible duplicate of [How do I get a DataFrame Index / Series column as an array or list?](https://stackoverflow.com/questions/17241004/how-do-i-get-a-dataframe-index-series-column-as-an-array-or-list) – Georgy Mar 28 '19 at 11:08

1 Answers1

3

First I think working with lists in pandas is not good idea, but is is possible by convert to lists:

df['one-hot'] = to_categorical(df['value').tolist()
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • In my case, I want to have a structure that stores a mapping between my values (few thousands of unique values), and the corresponding one-hot vectors (so vectors of few thousands of values) What do you think would be a better approach ? – Nakeuh Mar 28 '19 at 09:54
  • 1
    @Nakeuh - Better is create new DataFrame - `df1 = pd.DataFrame(to_categorical(df['value'), index=df.index)` – jezrael Mar 28 '19 at 09:54
  • 1
    Hm I see. I suppose that the 'all in one DataFrame' is less computing efficient ? I think I will stay with the non-efficient way for now (I prefer having only one object and I am not really concerned about efficiency in my use case), but I keep in mind your suggestion for the future. Thanks ! – Nakeuh Mar 28 '19 at 10:02