Pandas, reverse one hot encoding

Question

I one hot encoded some variable and after some computation I would like to retrieve the original one.

What I am doing is the following:

I filter the one hot encoded column names (they all start with the name of the original variable, let say 'mycol')

filter_col = [col for col in df if col.startswith('mycol')]

Then I can simply multiply the column names by the filtered variables.

X_test[filter_col]*filter_col

However, this leads to a sparse matrix. How do I create one single variable out of this? Summing doesn't work as the empty spaces are treated as numbers and doing this: sum(X_test[filter_col]*filter_col) I get

TypeError: unsupported operand type(s) for +: 'int' and 'str'

Any suggestion on how to proceed? Is this even the best approach or is there some function out there doing exactly what I need?

As request, here is an example, taken from here:

df= pd.DataFrame({ 
    'mycol':np.random.choice( ['panda','python','shark'], 10),
    })

df=pd.get_dummies(df)

Do you need `(X_test[filter_col]*filter_col).sum()` or `(X_test[filter_col]*filter_col).sum(axis=1)`? — jezrael, Jun 20 '19 at 08:31
Also if get all columns starting by string `mycol`, then also failed `X_test[filter_col]*filter_col` — jezrael, Jun 20 '19 at 08:34
Can you create some [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve)? — jezrael, Jun 20 '19 at 08:35
`X_test[filter_col].idxmax(1).str.replace('mycol_', '')` ..? — Chris Adams, Jun 20 '19 at 08:39
@jezrael Your first comment is the solution: `(X_test[filter_col]*filter_col).sum(axis=1)`. Would you post it as answer so I can accept it? — CAPSLOCK, Jun 20 '19 at 08:40
@ChrisA Thanks Chris =) You went a step forward and also cleaned up the result — CAPSLOCK, Jun 20 '19 at 08:42
ok, so never only `0`, always only one `1` per rows? If yes, then accepted solution working, else not. — jezrael, Jun 20 '19 at 08:56

jezrael · Answer 1 · 2019-06-20T08:57:06.853

If need sum values per rows:

(X_test[filter_col]*filter_col).sum(axis=1)

Solution if possible only 0 per rows or multiple 1 per rows:

X_test = pd.DataFrame({
         'mycolB':[0,1,1,0],
         'mycolC':[0,0,1,0],
         'mycolD':[1,0,0,0],

})


filter_col = [col for col in X_test if col.startswith('mycol')]
df = X_test[filter_col].dot(pd.Index(filter_col) + ', ' ).str.strip(', ')
print (df)
0            mycolD
1            mycolB
2    mycolB, mycolC
3                  
dtype: object

score 1 · Accepted Answer · answered Jun 20 '19 at 08:44

1

IIUC, you can use DataFrame.idxmax along axis=1. If necessary you can replace dummy prefix, with str.replace:

X_test[filter_col].idxmax(axis=1).str.replace('mycol_', '')

answered Jun 20 '19 at 08:44

Chris Adams

18,389
4
22
39

Pandas, reverse one hot encoding

2 Answers2