Python melt dataframe based on values of comma-separated character vector column

Question

I'm currently working on a test where I have different regions with some associated statistic, and a comma-separated list of genes that lie in those regions. This list will be variable in number, and may not contain anything ("NA").

How can I "melt" this dataframe:

 region_id  statistic      genelist
          1        2.5       A, B, C
          2        0.5    B, C, D, E
          3        3.2          <NA>
          4        0.1          E, F

Into something like this:

     region_id  statistic gene
           1       2.5    A
           1       2.5    B
           1       2.5    C
           2       0.5    B
           2       0.5    C
           2       0.5    D
           2       0.5    E
           3       3.2 <NA>
           4       0.1    E
           4       0.1    F

U13-Forward · Answer 1 · 2019-03-06T06:29:16.260

7

Use the below code, use stack to stack it, after that split on ', ', then stack it again, since we stacked it twice, use unstack to unstack with -2, then reset the index using reset_index with -1, after that do the final reset_index with no parameters:

print(df.set_index(['region_id', 'statistic'])
   .stack()
   .str.split(', ', expand=True)
   .stack()
   .unstack(-2)
   .reset_index(-1, drop=True)
   .reset_index()
)

edited Mar 06 '19 at 06:29

answered Mar 06 '19 at 06:13

U13-Forward

69,221
14
89
114

1

It will great, if you provide explanation of your code approach. – shaik moeed Mar 06 '19 at 06:24
@shaikmoeed Edited mine – U13-Forward Mar 06 '19 at 06:29

meW · Accepted Answer · 2019-03-06T06:17:17.057

Use:

# Splitting on , and joining with region_id and statistic columns
val = pd.concat([df.region_id, 
                 df.statistic, 
                 df.genelist.str.split(',', expand=True)], 
                axis=1)

# Unpivoting and ignoring variable column
m = pd.melt(val, id_vars=['region_id', 'statistic'])\
            .loc[:, ['region_id', 'statistic', 'value']]

# Ignoring Null values and sorting based on region_id
m[m.value.notnull()]\
.sort_values('region_id')\
.reset_index(drop=True)\
.rename(columns={'value':'gene'})

 region_id  statistic gene
       1       2.5    A
       1       2.5    B
       1       2.5    C
       2       0.5    B
       2       0.5    C
       2       0.5    D
       2       0.5    E
       3       3.2 <NA>
       4       0.1    E
       4       0.1    F

score 4 · Answer 3 · answered Mar 06 '19 at 07:03

using stack

df=df.join(df.pop('genelist').str.split(',',expand=True))
df.set_index(['region_id','statistic']).stack().reset_index(level=[0,1],name='gene')

using melt

df=df.join(df.pop('genelist').str.split(',',expand=True))
pd.melt(df,id_vars=['region_id','statistic'],value_name='gene').dropna()

score 1 · Answer 4 · answered May 23 '20 at 17:46

You can also use df.assign and explode to do it. Explode is used to break columns data into multiple rows given that the data is in a list format. Each data in genelist can be converted to a list, separating it using a comma and then use explode on the genelist column.

(df.assign(genelist=df.genelist.str.split(',')).explode('genelist'))

Python melt dataframe based on values of comma-separated character vector column

4 Answers4

Linked