Let's say my df is:
index "A" "B"
0 A1 "B1,B2,B3"
1 A2 "B2,B4,B3"
2 A3 "B2,B3,B5"
and I want to do magical_function(df)
index "B'" "A''"
0 B1 "A1"
1 B2 "A1,A2,A3"
2 B3 "A1,A2,A3"
3 B4 "A2"
4 B5 "A3"
So I used an exploding strategy (all merit to posters here: pandas: How do I split text in a column into multiple rows?)
I thus dropna first to avoid mistakes, then I make a Series with the column to split, I explode it and stack it, and then the join magic with the same index duplicate the "A" columns values where needed
dcolumn="A"
col="B"
current_wdf=df[[idcolumn,col]].dropna()
current_col=current_wdf.loc[:,col]
exploded_df=current_col.str.split(',').apply(pd.Series,1).stack()#much slower but keep the index. I could used substitution with enumerate after dropping level
exploded_df.index=exploded_df.index.droplevel(-1)
exploded_df.name=col
agg_df=pd.DataFrame(current_wdf.loc[:,idcolumn]).join(exploded_df)
grouped=agg_df.groupby([col])
After what I have:
0 1
0 B1 A1
1 B2 A1
1 B2 A2
1 B2 A3
2 B3 A1
2 B3 A2
2 B3 A3
3 B4 A2
4 B5 A3
Then I do
grouped=agg_df.groupby([col])
gives
a dict
{'B1': Int64Index([0], dtype='int64'),
'B2': Int64Index([1, 1, 1], dtype='int64'),
'B3': Int64Index([2, 2, 2], dtype='int64'),
'B4': Int64Index([3], dtype='int64'),
'B5': Int64Index([4], dtype='int64')}
To have the dataframe I desire, I need to get past the "groups" limitation of only displaying the indexes and doing this
groups_dict= {k: list(grouped.get_group(v).loc[:,idcolumn]) for k, v in grouped.groups.items()}
or
agg_df2=agg_df.reset_index()
groups_dict2= {k: list(agg_df2.loc[v,idcolumn]) for k,v in grouped.indices.items()}
And I finally have the dataframe but both are feeling slow.
But that was less than trivial, and I am dubious about the last part. It works, but it is slow, and could easily break.
Are there no operation for such matching reversal process? And, for groups content retrieval such as I exposed, are there any way to grab the contents instead of the indexes without have to perform repeatedly get_group?