1

I have a pandas dataframe that has combinations of two id columns such as this:

ID1 ID2
A B
A C
A D
A E
B C
B D
B E
C D
C E
D E
F H
I K
K J
G F
G H
I J

Here we have the choose 2 combinations for ABCD, FGH, IJK.

I would like to only keep the rows for the value with the most ID1's for a particular set. For ABCD this would be A, for FGH this would be G, and for IJK this would be I. Resulting in the following:

ID1 ID2
A B
A C
A D
A E
I K
G F
G H
I J
Bill K
  • 79
  • 4

2 Answers2

3

Calculate the count of unqiue values in ID1, then inside a list comprehension for each set calculate the index of maximum value, finally use these indices to filter the rows in dataframe

c = df['ID1'].value_counts()
i = [c.reindex([*s]).idxmax() for s in ['ABCB', 'FGH', 'IJK']]

df[df['ID1'].isin(i)]

   ID1 ID2
0    A   B
1    A   C
2    A   D
3    A   E
11   I   K
13   G   F
14   G   H
15   I   J
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • Nice one ;) but you need to have prior knowledge of the groups – mozway Feb 16 '22 at 14:53
  • Yes, I should have mentioned no previous knowledge of the groups. This is similar to how I was trying but couldn't get passed that. – Bill K Feb 16 '22 at 15:07
3

Assuming you don't know in advance the groups, this could be approached as a graph problem using networkx.

You have the following graph:

enter image description here

What you need is to find the root of each cluster (see here for the method to find roots).

import networkx as nx

G = nx.from_pandas_edgelist(df, source='ID1', target='ID2',
                            create_using=nx.DiGraph)

roots = [n for n,d in G.in_degree() if d==0] 

df2 = df[df['ID1'].isin(roots)]

output:

   ID1 ID2
0    A   B
1    A   C
2    A   D
3    A   E
11   I   K
13   G   F
14   G   H
15   I   J
Sunderam Dubey
  • 1
  • 11
  • 20
  • 40
mozway
  • 194,879
  • 13
  • 39
  • 75
  • root is always the one with the most children or there may be charts where root is not the one with the most children? – ansev Feb 16 '22 at 15:06
  • @ansev as it's the root, I guess it is the parent of all, so it has the most children (per cluster), no? – mozway Feb 16 '22 at 15:07
  • I have never heard of this package, thanks for pointing me in the direction. – Bill K Feb 16 '22 at 15:09
  • The point here is to know if they have to be children or children / grandchildren / ... :) But nice as usual:) @mozway – ansev Feb 16 '22 at 15:11
  • What is nice with `networkx` (or graph theory in general), is that complex pandas transformations can sometimes be greatly simplified – mozway Feb 16 '22 at 15:13