1

I have a sample DataFrame as follows:

track_id track_date status status_info
track_1 2021-01-01 approved None
track_2 2021-01-02 None accredited
track_3 2021-01-03 approved accredited
track_4 2021-01-04 approved approved
track_5 2021-01-05 approved|approved accredited|cancelled
track_6 2021-01-06 None accredited|cancelled

And I need to split both status and status_info into rows, so it gives me an output similar to the one below:

track_id track_date status status_info
track_1 2021-01-01 approved None
track_2 2021-01-02 None accredited
track_3 2021-01-03 approved accredited
track_4 2021-01-04 approved approved
track_5 2021-01-05 approved accredited
track_5 2021-01-05 approved cancelled
track_6 2021-01-06 None accredited
track_6 2021-01-06 None cancelled

I have already tried the code below, using this answer in another question as reference:

# splitting string values into lists
new_status = df['status'].str.split('|', expand=True).stack().reset_index(level=1, drop=True)
new_status_info = df['status_info'].str.split('|', expand=True).stack().reset_index(level=1, drop=True)

# generating a temporary DataFrame to join later (error here)
df_split = pd.concat([new_status, new_status_info], axis=1, keys=['status', 'status_info'])

# then, we join both DataFrames
df.drop(columns=['status','status_info'], axis=1).join(df_split).reset_index(drop=True)

But it gives me a ValueError:

ValueError: cannot reindex from a duplicate axis

When I modify .reset_index(level=1, drop=True) to .reset_index(drop=True) on the split step, the join operation only brings me one of the values, and not the expected two:

track_id track_date status status_info
track_1 2021-01-01 approved None
track_2 2021-01-02 None accredited
track_3 2021-01-03 approved accredited
track_4 2021-01-04 approved approved
track_5 2021-01-05 approved cancelled
track_6 2021-01-06 None accredited
njbln
  • 17
  • 7

2 Answers2

2

You can melt, explode, pivot:

cols = ['track_id', 'track_date']

(df.melt(cols, ignore_index=False).reset_index()
   .assign(value=lambda d: d['value'].str.split('|'))
   .explode('value')
   .assign(n=lambda d: d.groupby(level=0).cumcount())
   .pivot(index=cols+['index', 'n'], columns='variable', values='value')
   .reset_index(cols).droplevel('n')
   .groupby(level=0).ffill()
   .rename_axis(index=df.index.name, columns=df.columns.name)
)

Output:

  track_id  track_date    status status_info
0  track_1  2021-01-01  approved        None
1  track_2  2021-01-02      None  accredited
2  track_3  2021-01-03  approved  accredited
3  track_4  2021-01-04  approved    approved
4  track_5  2021-01-05  approved  accredited
4  track_5  2021-01-05  approved   cancelled
5  track_6  2021-01-06      None  accredited
5  track_6  2021-01-06      None   cancelled
mozway
  • 194,879
  • 13
  • 39
  • 75
1

You can try to use itertools.zip_longest:

from itertools import zip_longest

df['status'] = df['status'].str.split('|')
df['status_info'] = df['status_info'].str.split('|')

# make `status`, `status_info` columns the same length
df[['status', 'status_info']] = df[['status', 'status_info']].apply(lambda x: [*zip(*zip_longest(x['status'], x['status_info']))], axis=1, result_type='expand')
print(df.explode(['status', 'status_info']))

Prints:

  track_id  track_date    status status_info
0  track_1  2021-01-01  approved        None
1  track_2  2021-01-02      None  accredited
2  track_3  2021-01-03  approved  accredited
3  track_4  2021-01-04  approved    approved
4  track_5  2021-01-05  approved  accredited
4  track_5  2021-01-05  approved   cancelled
5  track_6  2021-01-06      None  accredited
5  track_6  2021-01-06      None   cancelled
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91