Split multiple columns of lists into separate rows

Question

I have a dataframe like this -

df = pd.DataFrame(
    {'key': [1, 2, 3, 4],
     'col1': [['apple','orange'], ['pineapple'], ['','','guava','',''], ['','','orange','apple','']],
     'col2': [['087','799'], ['681'], ['078'], ['816','018']]
     }
)

#   key                   col1        col2
#0    1        [apple, orange]  [087, 799]
#1    2            [pineapple]       [681]
#2    3        [, , guava, , ]       [078]
#3    4  [, , orange, apple, ]  [816, 018]

I need to split the columns 'col1' and 'col2' and create separate rows, but map the list elements according to their indices. The desired output is this -

desired_df = pd.DataFrame(
    {'key': [1, 1, 2, 3, 4, 4],
     'col1': [['apple'],['orange'],['pineapple'], ['guava'], ['orange'],['apple']],
     'col2': [['087'],['799'], ['681'], ['078'], ['816'],['018']]
    }
)

In col1, there might be elements that are blanks, but the overall length of the non-empty col1 element will match with the length of the corresponding elements of col2. Examples: rows 2 and 3 of df.

I tried the following, but it did not work -

df.set_index(['key'])[['col1','col2']].apply(pd.Series).stack().reset_index(level=1, drop=True)

Do you really want the output to have every single value as a list with a single element? or just the plain scalar value? — ALollz, Aug 25 '20 at 16:20
just that they need to be in the same order as the input, ignoring the spaces — rbc-2019, Aug 25 '20 at 16:23

ALollz · Answer 1 · 2020-08-25T16:54:43.287

Since you know that the number of non-empty elements in each list will always match, you can explode each column separately, filter out the blanks, and join the results back. Add on a .reset_index() if you want 'key' back as a column.

import pandas as pd

pd.concat([df.set_index('key')[[col]].explode(col).query(f'{col} != ""')
           for col in ['col1', 'col2']], axis=1)

# Without the f-string
#pd.concat([df.set_index('key')[[col]].explode(col).query(col + ' != ""')
#           for col in ['col1', 'col2']], axis=1)

          col1 col2
key                
1        apple  087
1       orange  799
2    pineapple  681
3        guava  078
4       orange  816
4        apple  018

If you are using an older verions of pandas that doesn't allow for the explode method use @BEN_YO's method to unnest. I'll copy the relevant code over here since there are a few different versions to choose from.

import numpy as np

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')

pd.concat([unnesting(df.set_index('key')[[col]], explode=[col]).query(f'{col} !=""')
           for col in ['col1', 'col2']], axis=1)
# Same output as above

thank you! this is giving an error. i am using python 2.7. could that be the reason? is there an alternative that works on python 2.7? — rbc-2019, Aug 25 '20 at 16:33
@rbc-2019 Oh yes that would be an issue. `explode` is "new" as of a few verions ago, 0.24 I think, but that probably doesn't support 2.7. Give me a minute to update with a method someone else created to "explode" prior to that method — ALollz, Aug 25 '20 at 16:35
the issue isnt with the explode method, but with the f string part. I am getting an invalid syntax error — rbc-2019, Aug 25 '20 at 16:47
@rbc-2019 oooh yeah I should have realized that. See the update just create the string yourself (ideally column names are strings, else you can `str()` them. — ALollz, Aug 25 '20 at 16:55

Suryaveer Singh · Answer 2 · 2020-08-25T17:28:53.673

2

try creating new df on top of old one like this

df['key'] =  df.apply(lambda x: [x['key']]*len(x['col2']), axis=1)
lst_col = ['key', 'col1', 'col2']
df = pd.DataFrame({
    col:[x for lst in list(df[col]) for x in lst if x!=""] for col in lst_col

})

Output

    key col1       col2
0   1   apple       087
1   1   orange      799
2   2   pineapple   681
3   3   guava       078
4   4   orange      816
5   4   apple       018

edited Aug 25 '20 at 17:28

answered Aug 25 '20 at 16:36

Suryaveer Singh

577
2
13

thank you! but I am losing the 'key' column by doing this. how can i retain that? – rbc-2019 Aug 25 '20 at 16:42

score 0 · Answer 3 · answered Aug 25 '20 at 17:46

For the sake of complexity :)

 pd.DataFrame([j for i in [[{"key": x['key'],"col1": y,'col2':x['col2'][list(filter(None, x['col1'])).index(y)]} for y in list(filter(None, x['col1']))]for idx, x in df.iterrows()] for j in i])

Output

|   key | col1      |   col2 |
|------:|:----------|-------:|
|     1 | apple     |    087 |
|     1 | orange    |    799 |
|     2 | pineapple |    681 |
|     3 | guava     |    078 |
|     4 | orange    |    816 |
|     4 | apple     |    018 |

score 0 · Answer 4 · answered Aug 25 '20 at 18:04

try this

newkeys= list(itertools.chain.from_iterable(df.apply(lambda vals : [vals[0]]*len(vals[2]), axis=1).tolist()))
newcol1, newcol2 =  list(itertools.chain.from_iterable(df.col1)),  list(itertools.chain.from_iterable(df.col2))
newcol1=list(filter(None, newcol1))
pd.DataFrame(zip(*[newkeys, newcol1, newcol2]), columns=df.columns)

Split multiple columns of lists into separate rows

4 Answers4

Linked