1

I have a dataframe with one hot encoded categorical labels that looks like this:

id      label1: val1    label1: val2   label2: val1     label2: val2 
12         0                  1             0                1
13         1                  0             0                1
14         0                  0             1                0  

I want to transform this into something like this:

id      label1       label2
12       val2         val2
13       val1         val2 
14        Na          val1 

I am totally confused about how this can be achieved. Can anyone please guide me? The naive process I can think of is iterating over all columns and assign values accordingly:

for i in issues.columns[30:]:
    for x in issues[str(i)]:
        if x==0: continue
        issues[i.split(':')[0]][x]=i.split(':')[1]   

But this does not seem to work (maybe I am getting the logic wrong?)

edit ----------------

I need to replicate the solution for all similar hot-encoded columns in the same df.

df.columns[30:]

['Aspect: Documentation', 'Aspect: Packaging', 'Aspect: Performance',
       'Aspect: Security', 'Aspect: Testing', 'Aspect: UX', 'Backport: 14',
       'Backport: 15', 'Design Proposal: Accepted. PRs Welcome', 'Epic',
       'Expeditor: Bump Version Minor', 'Expeditor: Skip All',
       'Expeditor: Skip Changelog', 'Expeditor: Skip Habitat',
       'Expeditor: Skip Omnibus', 'Expeditor: Skip Version Bump',
       'Focus: Chef Server', 'Focus: Config', 'Focus: Data collector',
       'Focus: SCM Resources', 'Focus: knife', 'Focus: knife bootstrap',
       'Focus: knife ssh', 'Platform: Debian-like', 'Platform: Linux',
       'Platform: RHEL-like', 'Platform: Unix-like', 'Platform: Windows',
       'Platform: macOS', 'Priority: Critical', 'Priority: Low',
       'Priority: Medium', 'Status: Adopted', 'Status: Good First Issue',
       'Status: Help Wanted', 'Status: Incomplete', 'Status: Needs RFC',
       'Status: Needs Replication', 'Status: Sustaining Backlog',
       'Status: Untriaged', 'Status: Waiting on Contributor',
       'Triage: Confirmed', 'Triage: Declined', 'Triage: Duplicate',
       'Triage: Feature Request', 'Triage: Needs Information',
       'Triage: Support', 'Type: Breaking Change', 'Type: Bug',
       'Type: Build Breakage', 'Type: Chore', 'Type: Deprecation',
       'Type: Design Proposal', 'Type: Enhancement', 'Type: Regression',
       'Type: Tech Debt', 'Aspect: Integration', 'Aspect: Stability',
       'Focus: knife search']

If one row has multiple labels associated with it (say Type:Tech Debt and Type: Enhancement), we can add the labels one after in the corresponding row (ie Tech Debt,Enhancement)

Devarshi Goswami
  • 1,035
  • 4
  • 11
  • 26

1 Answers1

2

Use DataFrame.melt with filter 1, split values and pivoting by DataFrame.pivot_table:

print (df)
   id  label1: val1  label1: val2  label2: val1  label2: val2
0  12             1             1             0             1
1  13             1             0             0             1
2  14             0             0             1             0

#if necessary omit first 30 columns without id
#df = df[['id'] + df.columns[30:].tolist()]
df1 = df.melt('id').query('value == 1')
df1[['label','val']] = df1.variable.str.split(': ', expand=True)
df1 = df1.pivot_table(index='id',columns='label',values='val', aggfunc=','.join)
print (df1)
label     label1 label2
id                     
12     val1,val2   val2
13          val1   val2
14           NaN   val1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks a lot for the solution, but for some reason, the solution only runs for a subset of the actual number of columns. check image for reference https://ibb.co/s1T8X8Z – Devarshi Goswami Nov 19 '20 at 13:36
  • @DevarshiGoswami - There are 286, only not displayed, need [this](https://stackoverflow.com/a/30691921) for see all rows. – jezrael Nov 19 '20 at 13:39