Reverse a get_dummies encoding in pandas

Question

Column names are: ID,1,2,3,4,5,6,7,8,9.

The col values are either 0 or 1

My dataframe looks like this:

 ID     1    2    3    4    5    6   7   8   9 

1002    0    1    0    1    0    0   0   0   0
1003    0    0    0    0    0    0   0   0   0 
1004    1    1    0    0    0    0   0   0   0
1005    0    0    0    0    1    0   0   0   0
1006    0    0    0    0    0    1   0   0   0
1007    1    0    1    0    0    0   0   0   0
1000    0    0    0    0    0    0   0   0   0
1009    0    0    1    0    0    0   1   0   0

I want the column names in front of the ID where the value in a row is 1.

The Dataframe i want should look like this:

 ID      Col2
1002       2    // has 1 at Col(2) and Col(4)
1002       4    
1004       1    // has 1 at col(1) and col(2)
1004       2
1005       5    // has 1 at col(5)
1006       6    // has 1 at col(6)
1007       1    // has 1 at col(1) and col(3)
1007       3
1009       3    // has 1 at col(3) and col(7)
1009       7

Please help me in this, Thanks in advance

Zeel B Patel · Answer 1 · 2021-02-04T07:06:25.847

31

Pretty one-liner :)

new_df = df.idxmax(axis=1)

edited Feb 04 '21 at 07:06

answered Apr 16 '20 at 13:17

Zeel B Patel

691
5
16

NOTE: This doesn't work (and fails silently!) if there are any rows with all zeroes! It will just return the first index, since 0 is the max. – zmbc Aug 24 '23 at 16:13
A more robust solution, taking inspiration from BENY's, is: `df[df == 1].idxmax(axis=1)`. That will give a NaN value when a row is all zeroes. – zmbc Aug 24 '23 at 16:15

score 24 · Answer 2 · answered May 29 '20 at 12:14

Several great answers for the OP post. However, often get_dummies is used for multiple categorical features. Pandas uses a prefix separator prefix_sep to distinguish different values for a column.

The following function collapses a "dummified" dataframe while keeping the order of columns:

def undummify(df, prefix_sep="_"):
    cols2collapse = {
        item.split(prefix_sep)[0]: (prefix_sep in item) for item in df.columns
    }
    series_list = []
    for col, needs_to_collapse in cols2collapse.items():
        if needs_to_collapse:
            undummified = (
                df.filter(like=col)
                .idxmax(axis=1)
                .apply(lambda x: x.split(prefix_sep, maxsplit=1)[1])
                .rename(col)
            )
            series_list.append(undummified)
        else:
            series_list.append(df[col])
    undummified_df = pd.concat(series_list, axis=1)
    return undummified_df

Example

>>> df
     a    b    c
0  A_1  B_1  C_1
1  A_2  B_2  C_2
>>> df2 = pd.get_dummies(df)
>>> df2
   a_A_1  a_A_2  b_B_1  b_B_2  c_C_1  c_C_2
0      1      0      1      0      1      0
1      0      1      0      1      0      1
>>> df3 = undummify(df2)
>>> df3
     a    b    c
0  A_1  B_1  C_1
1  A_2  B_2  C_2

score 18 · Accepted Answer · edited Aug 28 '22 at 21:55

18

set_index + stack, stack will dropna by default

df.set_index('ID',inplace=True)

df[df==1].stack().reset_index().drop(0, axis=1)
Out[363]: 
     ID level_1
0  1002       2
1  1002       4
2  1004       1
3  1004       2
4  1005       5
5  1006       6
6  1007       1
7  1007       3
8  1009       3
9  1009       7

edited Aug 28 '22 at 21:55

Vladimir Fokow

3,728
2
5
27

answered May 30 '18 at 14:58

BENY

317,841
20
164
234

score 4 · Answer 4 · edited Jun 20 '20 at 09:12

`np.argwhere`

v = np.argwhere(df.drop('ID', 1).values).T
pd.DataFrame({'ID' : df.loc[v[0], 'ID'], 'Col2' : df.columns[1:][v[1]]})

  Col2    ID
0    2  1002
0    4  1002
2    1  1004
2    2  1004
3    5  1005
4    6  1006
5    1  1007
5    3  1007
7    3  1009
7    7  1009

argwhere gets the i, j indices of all non-zero elements in your DataFrame. Use the first column of indices to index into column ID, and the second column of indices to index into df.columns.

I transpose v before step 2 for cache efficiency, and less typing.

score 3 · Answer 5 · edited Feb 13 '20 at 06:54

Use:

df = (df.melt('ID', var_name='Col2')
       .query('value== 1')
       .sort_values(['ID', 'Col2'])
       .drop('value',1))

Alternative solution:

df = (df.set_index('ID')
        .mask(lambda x: x == 0)
        .stack()
        .reset_index()
        .drop(0,1))

print (df)
      ID Col2
8   1002    2
24  1002    4
2   1004    1
10  1004    2
35  1005    5
44  1006    6
5   1007    1
21  1007    3
23  1009    3
55  1009    7

Explanation:

First reshape values by melt or set_index with unstack
Filter only 1 by query or convert 0 to NaNs by mask
sort_values for first solution
create columns from MultiIndex by reset_index
Last remove unnecessary columns by drop

score 2 · Answer 6 · answered Oct 17 '22 at 11:35

2

As of pandas v.1.5.0, the following will do the trick

dummy_cols = [col1, col2, col3]
pd.from_dummies(df[dummy_cols])

answered Oct 17 '22 at 11:35

KingOtto

840
5
18

trpubz · Answer 7 · 2022-11-23T20:48:49.080

New in pandas 1.5.0 there is a builtin that inverts the operation performed by get_dummies(). Most of the time a prefix was added using the original label. Use the sep= parameter to get back original values.

df_w_dummies.head()
>>>
   | pitch_type_FF | pitch_type_CU | pitch_type_CH
--------------------------------------------------
  1|             0 |             0 |             1
  2|             1 |             0 |             0
  3|             1 |             0 |             0
  4|             0 |             1 |             0
  5|             0 |             0 |             1

# .from_dummies() returns a data frame
df_reversed = pd.from_dummies(df_w_dummies, sep='pitch_type_').rename(columns={'': 'pitch_type'})
df_reversed.head()
>>>
   | pitch_type 
---------------
  1|         CH 
  2|         FF
  3|         FF 
  4|         CU 
  5|         CH

score 0 · Answer 8 · answered Jun 07 '23 at 19:31

@Mahomet has the most correct answer, so inspired by that here's a version that uses the new from_dummies functionality in Pandas

import pandas as pd

# Just some quick fake data with multiple categorical columns
data = {"name": ["jill", "bob", "sue", "sally"],
        "color": ["blue", "red", "blue", "green"],
        "food": ["pizza", "tacos", "tacos", "pizza"]}
df = pd.DataFrame(data)
print(df)

# Run get dummies on the categorical columns
dum_df = pd.get_dummies(df, columns=["color", "food"])
print(dum_df)

# Now a hack to get columns with underscores, un-dummy, and stitch together
under_cols = [col for col in df.columns if '_' in col]
un_dum_df = pd.concat([df.drop(under_cols, axis=1), 
                       pd.from_dummies(df[under_cols], sep="_")], axis=1)
print(un_dum_df)

The output looks like this

❯ python dummies.py
    name  color   food
0   jill   blue  pizza
1    bob    red  tacos
2    sue   blue  tacos
3  sally  green  pizza

    name  color_blue  color_green  color_red  food_pizza  food_tacos
0   jill        True        False      False        True       False
1    bob       False        False       True       False        True
2    sue        True        False      False       False        True
3  sally       False         True      False        True       False

    name  color   food
0   jill   blue  pizza
1    bob    red  tacos
2    sue   blue  tacos
3  sally  green  pizza

NB: If you have columns names that have underscores that aren't from get_dummies this will probably fail in some interesting way.

Reverse a get_dummies encoding in pandas

8 Answers8

Example

`np.argwhere`

Linked

Related