Reversing 'one-hot' encoding in Pandas

Question

I want to go from this data frame which is basically one hot encoded.

 In [2]: pd.DataFrame({"monkey":[0,1,0],"rabbit":[1,0,0],"fox":[0,0,1]})

    Out[2]:
       fox  monkey  rabbit
    0    0       0       1
    1    0       1       0
    2    1       0       0
    3    0       0       0
    4    0       0       0

To this one which is 'reverse' one-hot encoded.

    In [3]: pd.DataFrame({"animal":["monkey","rabbit","fox"]})
    Out[3]:
       animal
    0  monkey
    1  rabbit
    2     fox

I imagine there's some sort of clever use of apply or zip to do thins but I'm not sure how... Can anyone help?

I've not had much success using indexing etc to try to solve this problem.

@PeadarCoyle, could you post your desired DF for this input DF: `pd.DataFrame({'dog': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 1}, 'fox': {0: 0, 1: 0, 2: 1, 3: 0, 4: 0, 5: 0}, 'monkey': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0, 5: 0}, 'rabbit': {0: 1, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0}})`, because now i don't understand your desired DF? — MaxU - stand with Ukraine, Jul 12 '16 at 20:57
@PeadarCoyle, could you please clarify whether your input data set might have more than one `1` in one column? And how did you get rows containing only zeroes? — MaxU - stand with Ukraine, Jul 12 '16 at 21:15

score 85 · Answer 1 · edited Jan 15 '23 at 14:42

85

UPDATE: i think ayhan is right and it should be:

df.idxmax(axis=1)

This chooses a column label for each row, where the label has the maximum value. Since the data are 1s and 0s, it will pick the positions of 1s.

Demo:

In [40]: s = pd.Series(['dog', 'cat', 'dog', 'bird', 'fox', 'dog'])

In [41]: s
Out[41]:
0     dog
1     cat
2     dog
3    bird
4     fox
5     dog
dtype: object

In [42]: pd.get_dummies(s)
Out[42]:
   bird  cat  dog  fox
0   0.0  0.0  1.0  0.0
1   0.0  1.0  0.0  0.0
2   0.0  0.0  1.0  0.0
3   1.0  0.0  0.0  0.0
4   0.0  0.0  0.0  1.0
5   0.0  0.0  1.0  0.0

In [43]: pd.get_dummies(s).idxmax(1)
Out[43]:
0     dog
1     cat
2     dog
3    bird
4     fox
5     dog
dtype: object

OLD answer: (most probably, incorrect answer)

try this:

In [504]: df.idxmax().reset_index().rename(columns={'index':'animal', 0:'idx'})
Out[504]:
   animal  idx
0     fox    2
1  monkey    1
2  rabbit    0

data:

In [505]: df
Out[505]:
   fox  monkey  rabbit
0    0       0       1
1    0       1       0
2    1       0       0
3    0       0       0
4    0       0       0

edited Jan 15 '23 at 14:42

Mustafa Aydın

17,645
4
15
38

answered Jul 12 '16 at 16:33

MaxU - stand with Ukraine

205,989
36
386
419

What happens if any of the columns repeat. Say two monkeys? [1,3 ] would this pick it up. – Merlin Jul 12 '16 at 18:31
5

Shouldn't it be `df.idxmax(axis=1)`? – ayhan Jul 12 '16 at 20:29
@ayhan, it looks much better, but, unfortunately, it doesn't always work properly! – MaxU - stand with Ukraine Jul 12 '16 at 20:31
@ayhan, try it against this DF: `pd.DataFrame({'dog': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 1}, 'fox': {0: 0, 1: 0, 2: 1, 3: 0, 4: 0, 5: 0}, 'monkey': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0, 5: 0}, 'rabbit': {0: 1, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0}})` – MaxU - stand with Ukraine Jul 12 '16 at 20:37
If you don't pass axis=1 it wil check the columns for 1s, but a column may have multiple 1s (a dataset might have more than one dogs, but an animal cannot be a dog and a cat at the same time :)). Yes your example is a possibility if the dummies were created using dropfirst=True, but in that case we should know what the first category was. Currently, there is no such information. – ayhan Jul 12 '16 at 20:44
@ayhan, if i understand `one-hot encoding` correctly there might be only one `1` (one) per column and OP want's to know their indexes... – MaxU - stand with Ukraine Jul 12 '16 at 20:46
1

It should be one 1 per row actually. You can try it with `pd.Series(['dog', 'cat', 'dog', 'bird']).str.get_dummies()`. get_dummies will always produce a structure like this (never more than one 1 in a row). OP's question is problematic. They want the original array which was used to create dummies but the order in the example is wrong (it should be rabbit, monkey, fox). Other than that, like I said it is a common practice to drop one of the columns while creating dummies (to avoid multicollinearity) but in order to return back to the original array we have to know what that column was. – ayhan Jul 12 '16 at 20:58
Even in that case, I think the use of idxmax() is the best way to go. Maybe first filter by all zeros and assign it to the dropped column. But again, OP should clarify that first. – ayhan Jul 12 '16 at 20:58
@ayhan, my understanding of `one-hot encoding` was, most probably, wrong - thank you for the example... – MaxU - stand with Ukraine Jul 12 '16 at 21:02
@ayhan, i still don't get it `pd.DataFrame({'dog': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 1}, 'fox': {0: 0, 1: 0, 2: 1, 3: 0, 4: 0, 5: 0}, 'monkey': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0, 5: 0}, 'rabbit': {0: 1, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0}}).idxmax(1)` - returns IMO unexpected results... – MaxU - stand with Ukraine Jul 12 '16 at 21:12
Yes because it has all zeros in some rows. You should first handle them (OP should tell us which animal it is). I would probably do something like this: `df.sum(axis=1).map({0: 'That animal'}).fillna(df.idxmax(axis=1))` – ayhan Jul 12 '16 at 21:14
@Merlin, sorry, now i'm not sure that i understand `one-hot encoding` correctly... – MaxU - stand with Ukraine Jul 12 '16 at 21:20
@MaxU, no problem. – Merlin Jul 12 '16 at 21:40
Note that if you use drop_first=True in get_dummies, you will get wrong results. – Guy s Feb 24 '20 at 11:02

score 18 · Accepted Answer · answered Jul 12 '16 at 16:42

I would use apply to decode the columns:

In [2]: animals = pd.DataFrame({"monkey":[0,1,0,0,0],"rabbit":[1,0,0,0,0],"fox":[0,0,1,0,0]})

In [3]: def get_animal(row):
   ...:     for c in animals.columns:
   ...:         if row[c]==1:
   ...:             return c

In [4]: animals.apply(get_animal, axis=1)
Out[4]: 
0    rabbit
1    monkey
2       fox
3      None
4      None
dtype: object

score 10 · Answer 3 · edited Oct 07 '18 at 15:07

This works with both single and multiple labels.

We can use advanced indexing to tackle this problem. Here is the link.

import pandas as pd

df = pd.DataFrame({"monkey":[1,1,0,1,0],"rabbit":[1,1,1,1,0],\
    "fox":[1,0,1,0,0], "cat":[0,0,0,0,1]})

df['tags']='' # to create an empty column

for col_name in df.columns:
    df.ix[df[col_name]==1,'tags']= df['tags']+' '+col_name

print df

And the result is:

   cat  fox  monkey  rabbit                tags
0    0    1       1       1   fox monkey rabbit
1    0    0       1       1       monkey rabbit
2    0    1       0       1          fox rabbit
3    0    0       1       1       monkey rabbit
4    1    0       0       0                 cat

Explanation: We iterate over the columns on the dataframe.

df.ix[selection criteria, columns to write value] = value
df.ix[df[col_name]==1,'tags']= df['tags']+' '+col_name

The above line basically finds you all the places where df[col_name] == 1, selects column 'tags' and set it to the RHS value which is df['tags']+' '+ col_name

Note: .ix has been deprecated since Pandas v0.20. You should instead use .loc or .iloc, as appropriate.

piRSquared · Answer 4 · 2016-07-12T20:13:29.187

4

I'd do:

cols = df.columns.to_series().values
pd.DataFrame(np.repeat(cols[None, :], len(df), 0)[df.astype(bool).values], df.index[df.any(1)])

Timing

MaxU's method has edge for large dataframes

Small df 5 x 3

Large df 1000000 x 52

edited Jul 12 '16 at 20:13

answered Jul 12 '16 at 16:39

piRSquared

285,575
57
475
624

score 4 · Answer 5 · answered Oct 15 '22 at 16:44

As of pandas 1.5.0, reversing one-hot encoding is supported directly with pandas.from_dummies:

import pandas as pd  # v 1.5.0

onehot_df = pd.DataFrame({
    "monkey": [0, 1, 0],
    "rabbit": [1, 0, 0],
    "fox": [0, 0, 1]
})

new_df = pd.from_dummies(onehot_df)

#          
# 0  rabbit
# 1  monkey
# 2     fox

The resulting DataFrame appears to have no column header (it's an empty string). To fix this, rename the column after from_dummies

new_df = pd.from_dummies(onehot_df).rename(columns={'': 'animal'})

#    animal
# 0  rabbit
# 1  monkey
# 2     fox

Alternatively, if the DataFrame is already defined with separated columns (like one-hot encoding produced by pandas.get_dummies), e.g.

import pandas as pd  # v 1.5.0

onehot_df = pd.DataFrame({
    'animal_fox': [0, 0, 1],
    'animal_monkey': [0, 1, 0],
    'animal_rabbit': [1, 0, 0]
})

#    animal_fox  animal_monkey  animal_rabbit
# 0           0              0              1
# 1           0              1              0
# 2           1              0              0

Simply specify the sep to reverse the encoding

new_df = pd.from_dummies(onehot_df, sep='_')

#    animal
# 0  rabbit
# 1  monkey
# 2     fox

The string before the first instance of the sep delimiter will become the column header in the new DataFrame (in this case "animal") and the rest of the string will become the column values (in this case "rabbit", "monkey", "fox").

score 3 · Answer 6 · answered Sep 19 '19 at 15:45

You could try using melt(). This method also works when you have multiple OHE labels for a row.

# Your OHE dataframe 
df = pd.DataFrame({"monkey":[0,1,0],"rabbit":[1,0,0],"fox":[0,0,1]})

mel = df.melt(var_name=['animal'], value_name='value') # Melting

mel[mel.value == 1].reset_index(drop=True) # this gives you the result

Merlin · Answer 7 · 2016-07-12T18:17:45.730

1

Try this:

df = pd.DataFrame({"monkey":[0,1,0,1,0],"rabbit":[1,0,0,0,0],"fox":[0,0,1,0,0], "cat":[0,0,0,0,1]})
df 

   cat  fox  monkey  rabbit
0    0    0       0       1
1    0    0       1       0
2    0    1       0       0
3    0    0       1       0
4    1    0       0       0

pd.DataFrame([x for x in np.where(df ==1, df.columns,'').flatten().tolist() if len(x) >0],columns= (["animal"]) )

   animal
0  rabbit
1  monkey
2     fox
3  monkey
4     cat

edited Jul 12 '16 at 18:17

answered Jul 12 '16 at 16:48

Merlin

24,552
41
131
206

I included in timing over larger dataframe. – piRSquared Jul 12 '16 at 20:14

score 0 · Answer 8 · answered Oct 14 '19 at 16:07

It can be achieved with a simple apply on dataframe

# function to get column name with value one for each row in dataframe
def get_animal(row):
    return(row.index[row.apply(lambda x: x==1)][0])

# prepare a animal column
df['animal'] = df.apply(lambda row:get_animal(row), axis=1)

score 0 · Answer 9 · answered Oct 07 '22 at 20:54

0

A way to deal with multiple labels without a for cycle. The result will be a list column. If you have the same number of labels in each row, you can add result_type='expand' to get several columns.

df.apply(lambda x: df.columns[x==1], axis=1)

answered Oct 07 '22 at 20:54

Denis Kazakov

77
6

Reversing 'one-hot' encoding in Pandas

9 Answers9

Timing

Linked

Related