Reverse grouping relationship between table/df columns

Question

Let's say my df is:

index "A" "B"
0 A1 "B1,B2,B3"
1 A2 "B2,B4,B3"
2 A3 "B2,B3,B5"

and I want to do magical_function(df)

index "B'" "A''"
0 B1 "A1"
1 B2 "A1,A2,A3"
2 B3 "A1,A2,A3"
3 B4 "A2"
4 B5 "A3"

So I used an exploding strategy (all merit to posters here: pandas: How do I split text in a column into multiple rows?)
I thus dropna first to avoid mistakes, then I make a Series with the column to split, I explode it and stack it, and then the join magic with the same index duplicate the "A" columns values where needed

dcolumn="A"
col="B"
current_wdf=df[[idcolumn,col]].dropna()
current_col=current_wdf.loc[:,col]
exploded_df=current_col.str.split(',').apply(pd.Series,1).stack()#much slower but keep the index. I could used substitution with enumerate after dropping level
exploded_df.index=exploded_df.index.droplevel(-1)
exploded_df.name=col
agg_df=pd.DataFrame(current_wdf.loc[:,idcolumn]).join(exploded_df)
grouped=agg_df.groupby([col])

After what I have:

0   1

0 B1 A1
1 B2 A1
1 B2 A2
1 B2 A3
2 B3 A1
2 B3 A2
2 B3 A3
3 B4 A2
4 B5 A3

Then I do

grouped=agg_df.groupby([col])
gives
a dict
{'B1': Int64Index([0], dtype='int64'),
'B2': Int64Index([1, 1, 1], dtype='int64'),
'B3': Int64Index([2, 2, 2], dtype='int64'),
'B4': Int64Index([3], dtype='int64'),
'B5': Int64Index([4], dtype='int64')}

To have the dataframe I desire, I need to get past the "groups" limitation of only displaying the indexes and doing this

groups_dict= {k: list(grouped.get_group(v).loc[:,idcolumn]) for k, v in grouped.groups.items()}

or

agg_df2=agg_df.reset_index()
groups_dict2= {k: list(agg_df2.loc[v,idcolumn]) for k,v in grouped.indices.items()}

And I finally have the dataframe but both are feeling slow.

But that was less than trivial, and I am dubious about the last part. It works, but it is slow, and could easily break.

Are there no operation for such matching reversal process? And, for groups content retrieval such as I exposed, are there any way to grab the contents instead of the indexes without have to perform repeatedly get_group?

jezrael · Accepted Answer · 2017-05-31T14:43:58.927

1

Use set_index + str.split + stack + groupby + apply + reset_index :

df = df.set_index('A')['B']
       .str.split(',', expand=True)
       .stack()
       .reset_index(name='B')
       .groupby('B')['A']
       .apply(', '.join)
       .reset_index()
print (df)
    B           A
0  B1          A1
1  B2  A1, A2, A3
2  B3  A1, A2, A3
3  B4          A2
4  B5          A3

Another solution with numpy.concatenate + numpy.repeat + DataFrame constructor:

s = df.set_index('A')['B'].str.split(',')
l = s.str.len()
df1 = pd.DataFrame({'A': np.repeat(df['A'].values, l), 'B':np.concatenate(s)})
df1 = df1.groupby('B')['A'].apply(', '.join).reset_index()
print (df1)
    B           A
0  B1          A1
1  B2  A1, A2, A3
2  B3  A1, A2, A3
3  B4          A2
4  B5          A3

edited May 31 '17 at 14:43

answered May 31 '17 at 12:22

jezrael

822,522
95
1,334
1,252

Well; I am sorry, there must be something missing, it either yields : index "A" "B" 0 A1 "B1,B2,B3" 1 A2 "B2,B4,B3" 2 A3 "B2,B3,B5" – Ando Jurai May 31 '17 at 13:50
1

Super, and sorry again. ;) Good luck! – jezrael May 31 '17 at 14:48

Reverse grouping relationship between table/df columns

1 Answers1