1

I have pandas dataframe that looks like this:

label pred gt
label1 val1 val11
label2 ['str1', str2'] ['str1', 'str3', 'str4']
label3 foo box

And I want to convert label2 row where I have lists of strings or None value to multiple rows (in case it is a list of strings):

label pred gt
label1 val1 val11
label2 'str1' 'str1'
label2 str2' 'str3'
label2 None 'str4'
label3 foo box

I have used explode() for this purpose but I get new dataframe with all nan values and the 'exploded' rows are not matched to the right label. Here is my code:

df_filtered = output_df[output_df['label'] == 'label2']

# explode the list column into multiple rows while keeping other columns
df_exploded = pd.concat([
    df_filtered.drop(['pred', 'gt'], axis=1),
    df_filtered['pred'].explode().reset_index(drop=True),
    df_filtered['gt'].explode().reset_index(drop=True)
], axis=1)

# add prefix to the existing column name (label) to differentiate each new row
df_exploded = df_exploded.add_prefix('new_')

# rename the columns to remove the prefix from the original column
df_exploded = df_exploded.rename(columns={'new_pred': 'pred', 'new_gt': 'gt'})

# combine the exploded dataframe with the original dataframe, dropping the original list column
df_combined = pd.concat([output_df.drop(['pred', 'gt'], axis=1), df_exploded], axis=1)

Any help would be appreciated.

Yana
  • 785
  • 8
  • 23

1 Answers1

1

You can try to explode independently and concat the de-duplicated rows:

cols = ['pred', 'gt']

others = df.columns.difference(cols)
out = pd.concat([df.explode(c)[others.union([c])]
                   .assign(n=lambda d: d.groupby(level=0).cumcount())
                   .set_index(['n']+list(others), append=True)
                 for c in cols], axis=1
               ).sort_index(level=[0, 1]).droplevel(1).reset_index(others)

print(out)

Output:

    label  pred     gt
0  label1  val1  val11
1  label2  str1   str1
1  label2  str2   str3
1  label2   NaN   str4
2  label3   foo    box
mozway
  • 194,879
  • 13
  • 39
  • 75
  • I get this error: InvalidIndexError: Reindexing only valid with uniquely valued Index objects – Yana Mar 29 '23 at 07:14
  • 1
    Your original index must be non duplicated (`df = df.reset_index(drop=True)`). If this is not the case prepend a unique index to it and discard it in the end ;) – mozway Mar 29 '23 at 07:28