3

Background

I have the following dataframe

import pandas as pd
df = pd.DataFrame({'Result' : [['pos', '+', 'pos', 'positive'], ['neg', 'neg'], [], ['pos']], 
                  'P_ID': [1,2,3,4], 
                  'Gene' : [['kras', 'kras', 'kras', 'egfr'], ['brca', 'brca'], [], ['cd133']],
                  'N_ID' : ['A1', 'A2', 'A3', 'A4']

                 })

#rearrange columns
df = df[['P_ID', 'N_ID', 'Gene', 'Result']]
df


  P_ID N_ID         Gene                         Result
0   1   A1  [kras, kras, kras, egfr]    [pos, +, pos, positive]
1   2   A2  [brca, brca]                [neg, neg]
2   3   A3  []                          []
3   4   A4  [cd133]                     [pos]

And I use the following code taken from unnest (explode) multiple list 2.0

df.set_index('P_ID').apply(lambda x: x.apply(pd.Series).stack()).ffill().reset_index().drop('level_1', 1)

Problem

Using the code above, I get close to what I want. However, because the Gene and Result columns in the third row 2 are empty lists [], I get the output of the row above it as seen below:

 P_ID   N_ID Gene   Result
0   1   A1  kras    pos
1   1   A1  kras    +
2   1   A1  kras    pos
3   1   A1  egfr    positive
4   2   A2  brca    neg
5   2   A2  brca    neg
6   3   A3  brca    neg
7   4   A4  cd133   pos

Instead, I would like to get the output below, where the following row 6 3 A3 [] [] reflects the original dataframe df, which contained empty lists

Desired Output

   P_ID N_ID Gene   Result
0   1   A1  kras    pos
1   1   A1  kras    +
2   1   A1  kras    pos
3   1   A1  egfr    positive
4   2   A2  brca    neg
5   2   A2  brca    neg
6   3   A3  []      []
7   4   A4  cd133   pos

Question

How do I get the desired output?

cs95
  • 379,657
  • 97
  • 704
  • 746

2 Answers2

2

Let's try some stacking and unstacking magic. This preserves empty lists as well.

(df.set_index(['P_ID', 'N_ID'])
   .stack()
   .str.join(',')
   .str.split(',', expand=True)
   .stack()
   .unstack(-2)
   .reset_index(level=[0, 1])
   .reset_index(drop=True))

   P_ID N_ID    Result   Gene
0  1     A1   pos       kras 
1  1     A1   +         kras 
2  1     A1   pos       kras 
3  1     A1   positive  egfr 
4  2     A2   neg       brca 
5  2     A2   neg       brca 
6  3     A3                  
7  4     A4   pos       cd133

Details

First, set the columns that are not to be touched as the index.

df.set_index(['P_ID', 'N_ID'])

                            Result                      Gene
P_ID N_ID                                                   
1    A1    [pos, +, pos, positive]  [kras, kras, kras, egfr]
2    A2    [neg, neg]               [brca, brca]            
3    A3    []                       []                      
4    A4    [pos]                    [cd133]                 

Next, stack the rows.

_.stack()

P_ID  N_ID        
1     A1    Result    [pos, +, pos, positive] 
            Gene      [kras, kras, kras, egfr]
2     A2    Result    [neg, neg]              
            Gene      [brca, brca]            
3     A3    Result    []                      
            Gene      []                      
4     A4    Result    [pos]                   
            Gene      [cd133]                 
dtype: object

We have a series now. We need to explode these elements into separate columns. So, first join the lists, then split again. This works assuming your lists elements don't contain commas themselves (if not, find another separator to join and split on).

_.str.join(',').str.split(',', expand=True)

                      0     1     2         3
P_ID N_ID                                    
1    A1   Result  pos    +     pos   positive
          Gene    kras   kras  kras  egfr    
2    A2   Result  neg    neg   None  None    
          Gene    brca   brca  None  None    
3    A3   Result         None  None  None    
          Gene           None  None  None    
4    A4   Result  pos    None  None  None    
          Gene    cd133  None  None  None    

We need to get rid of NULL values, so call stack again.

_.stack()

P_ID  N_ID           
1     A1    Result  0    pos     
                    1    +       
                    2    pos     
                    3    positive
            Gene    0    kras    
                    1    kras    
                    2    kras    
                    3    egfr    
2     A2    Result  0    neg     
                    1    neg     
            Gene    0    brca    
                    1    brca    
3     A3    Result  0            
            Gene    0            
4     A4    Result  0    pos     
            Gene    0    cd133   
dtype: object

We're almost there. Now we want the second last level of the index to become our columns, so unstack using unstack(-2) (unstack on the second last level)

_.unstack(-2)

               Result   Gene
P_ID N_ID                   
1    A1   0  pos       kras 
          1  +         kras 
          2  pos       kras 
          3  positive  egfr 
2    A2   0  neg       brca 
          1  neg       brca 
3    A3   0                 
4    A4   0  pos       cd133

Lastly, some housekeeping to obtain our original columns.

_.reset_index(-1, drop=True).reset_index()

   P_ID N_ID    Result   Gene
0  1     A1   pos       kras 
1  1     A1   +         kras 
2  1     A1   pos       kras 
3  1     A1   positive  egfr 
4  2     A2   neg       brca 
5  2     A2   neg       brca 
6  3     A3                  
7  4     A4   pos       cd133

If you want the blanks to actually be lists, use applymap:

_.applymap(lambda x: x if x != '' else []))

   P_ID N_ID    Result   Gene
0  1     A1   pos       kras 
1  1     A1   +         kras 
2  1     A1   pos       kras 
3  1     A1   positive  egfr 
4  2     A2   neg       brca 
5  2     A2   neg       brca 
6  3     A3   []        []   
7  4     A4   pos       cd133
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Thanks for this, extremely informative, with pandas unstack, does it unstack series or indexes? I always get confused with that – Umar.H Jun 03 '19 at 02:53
  • @Datanovice It unstacks a Series along an index or set of indices. – cs95 Jun 03 '19 at 03:12
2

After adjust unnesting still work for it

df=df.applymap(lambda x : [''] if x==[] else x)
unnesting(df,['Gene','Result'])
Out[20]: 
    Gene    Result N_ID  P_ID
0   kras       pos   A1     1
0   kras         +   A1     1
0   kras       pos   A1     1
0   egfr  positive   A1     1
1   brca       neg   A2     2
1   brca       neg   A2     2
2                    A3     3
3  cd133       pos   A4     4
BENY
  • 317,841
  • 20
  • 164
  • 234