1

I have a dataframe df as

df
       col1 act_id col2                                                                                                 
   --------------------
0  40;30;30   act1 A;B;C
1  25;50;25   act2 D;E;F
2     70;30   act3 G;H

I want to break each record in such a way that values in column col1 and col2 explode into multiple rows but such that the first value in col1 after splitting upon ';' corresponds to the first value in col2 after splitting upon ';'. So my desired_df should look like this:

desired_df
       col1 act_id col2                                                                                                 
       ---------------
    0  40   act1   A
    1  30   act1   B
    2  30   act1   C
    3  25   act2   D
    4  50   act2   E
    5  25   act2   F                                                                                                  
    6  70   act3   G                                                                              
    7  30   act3   H                                                                               

NOTE: this is not the same as Split (explode) pandas dataframe string entry to separate rows as here the exploding/splitting of one record is not just across one column but the need is to split or explode one row into multiple rows, in two columns simultaneously.

Any help is appreciated. Thanks

Aman Singh
  • 1,111
  • 3
  • 17
  • 31
  • you can use `unnesting(df,['col1','col2'])` from here: https://stackoverflow.com/questions/53218931/how-to-unnest-explode-a-column-in-a-pandas-dataframe/53218939#53218939 after splitting the cols by `;` like `df.col1=df.col1.str.split(';')` and same for col2 – anky Jun 04 '19 at 04:41

3 Answers3

1

one way to do this

df2.set_index('act_id').apply(lambda x: pd.Series(x.col1.split(';'),x.col2.split(';')), axis=1).stack().dropna().reset_index()

df2.columns = ['col1','act_id','col2']

  col1 act_id col2
0  A    act1   40 
1  B    act1   30 
2  C    act1   30 
3  D    act2   25 
4  E    act2   50 
5  F    act2   25 
6  G    act3   70 
7  H    act3   30 
iamklaus
  • 3,720
  • 2
  • 12
  • 21
  • this solution worked for this particular df, although when i tried running this on a bigger dataframe with about 1M rows, i got the error as `ValueError: cannot reindex from a duplicate axis` I don't understand because even the bigger original dataframe had just these columns. – Aman Singh Jun 04 '19 at 05:28
  • try the updated..problem might be related to duplicated index values, if the update doesn't works then try breaking the code and running it step by step, will help u debug the problem (do post on which step u faced the error) – iamklaus Jun 04 '19 at 05:45
  • I tried to break it down step by step and I found that the error occurs at the `set_index('act_id')` method. I removed all other methods and still got the error as `ValueError: cannot reindex from a duplicate axis` Is there any other way to get to the desired result...? – Aman Singh Jun 04 '19 at 06:00
  • i still end up having the error `ValueError: cannot reindex from a duplicate axis` – Aman Singh Jun 04 '19 at 09:19
0

the idea is col1 and col2 should be exploded and then merge on index and join back to the original dataframe.

df1 = df.col1.str.split(";").apply(pd.Series).stack().droplevel(1).reset_index()
df2 = df.col2.str.split(";").apply(pd.Series).stack().droplevel(1).reset_index()
df12 = pd.merge(df1, df2[0], left_index=True, right_index=True)
df12.columns = ["index", "col1", "col2"]

pd.merge(df12, df["act_id"], left_on="index", right_index=True)
Dyno Fu
  • 8,753
  • 4
  • 39
  • 64
0

A generic function could be:

list_cols = {'col1','col2'}
other_cols = list(set(df.columns) - set(list_cols))
exploded = [df[col].explode() for col in list_cols]
desired_df = pd.DataFrame(dict(zip(list_cols, exploded)))
desired_df = df[other_cols].merge(desired_df, how="right", left_index=True, right_index=True)

Please strsplit the col 1 and 2 before calling above function

Allohvk
  • 915
  • 8
  • 14