1

I have an issue similar to this one with a few differences/complications

I have a list of groups containing members, rather than merging the groups that share members I need to preserve the groupings and create a new set of edges based on which groups have members in common, and do so conditionally based on attributes of the groups

The source data looks like this:

+----------+------------+-----------+
| Group ID | Group Type | Member ID |
+----------+------------+-----------+
| A        | Type 1     |         1 |
| A        | Type 1     |         2 |
| B        | Type 1     |         2 |
| B        | Type 1     |         3 |
| C        | Type 1     |         3 |
| C        | Type 1     |         4 |
| D        | Type 2     |         4 |
| D        | Type 2     |         5 |
+----------+------------+-----------+

Desired output is this:

+----------+-----------------+
| Group ID | Linked Group ID |
+----------+-----------------+
| A        | B               |
| B        | C               |
+----------+-----------------+

A is linked to B because it shares 2 in common B is linked to C because it shares 3 in common C is not linked to D, it has a member in common but is of a different type

The number of shared members doesn't matter for my purposes, a single member in common means they're linked

The output is being used as the edges of a graph, so if the output is a graph that fits the rules that's fine

The source dataset is large (hundreds of millions of rows), so performance is a consideration

This poses a similar question, however I'm new to Python and can't figure out how to get the source data to a point where I can use the answer, or work in the additional requirement of the group type matching

  • Hi, welcome on SO. A similar question of your sugestion uses a different concept of connected components. According to your requirements B should not be connected with E (unlikely to answer suggested), isn't it? Also are Group Type and Member ID sorted? – mathfux Aug 25 '20 at 18:52
  • Thank you, lots of time here getting tips, but really stumped this time! There's no E in the example, and there would be no pattern if that's what you mean? Group type and member ID could be sorted at source if that were to help – SquirreledHogs Aug 25 '20 at 19:24
  • I mean that one of solutions in links you mentioned should be a good example that your problem is not equivalent to finding of connected components (especially, this one: https://stackoverflow.com/questions/46200969/how-to-group-all-labels-index-which-shares-at-least-one-1-in-the-same-column?noredirect=1&lq=1). In one of the answers components `[A, C]` and `[B, D, E]` are found but `B` and `E` is not a suitable pair you would like to get. – mathfux Aug 25 '20 at 20:06

1 Answers1

1

Try some thing like this-

df1=df.groupby(['Group Type','Member ID'])['Group ID'].apply(','.join).reset_index()
df2=df1[df1['Group ID'].str.contains(",")]

This might not handle the case of cyclic grouping.

Sushant
  • 180
  • 1
  • 7