I have a pandas dataframe of the form:
index | id | group
0 | abc | A
1 | abc | B
2 | abc | B
3 | abc | C
4 | def | A
5 | def | B
6 | ghi | B
7 | ghi | C
I would like to transform this to a weighted graph / adjacency matrix where nodes are the 'group', and the weights are the sum of shared ids per group pair:
The weights are the count of the group pair combinations per id, so:
AB = 'abc' indexes (0,1),(0,2) + 'def' indexes (4,5) = 3
AC = 'abc' (0,3) = 1
BC = 'abc' (2,3), (1,3) + 'ghi' (6,7) = 3
and the resulting matrix would be:
A |B |C
A| 0 |3 |1
B| 3 |0 |3
C| 1 |3 |0
At the moment I am doing this very inefficiently by:
f = df.groupby(['id']).agg({'group':pd.Series.nunique}) # to count groups per id
f.loc[f['group']>1] # to get a list of the ids with >1 group
# i then for loop through the id's getting the count of values per pair (takes a long time).
This is a first pass crude hack approach, I'm sure there must be an alternative approach using groupby or crosstab but I cant figure it out.