Pandas groupby two columns and count shared values in third

Question

In Pandas I would like to groupby two columns and calculate how many third column values are shared. With the addition of preference for greater sharing.

In the dataframe below, group col1 values, group col2 values and count how often col3 values are shared by col2 values.

The result is: ID1 & ID2 share a col3 value (2). ID3 shares with none (1). However, ID1, ID2 and ID4 also share a value (3). As ID1 & ID2 already share a value take the value that is shared by both IDs and more (3). Therefore the answer is 3,1. The list of counts must always = the nunique col2 values.

col1	col2	col3
A	ID1	15
A	ID1	16
A	ID1	12
A	ID2	15
A	ID2	12
A	ID3	18
A	ID4	19
A	ID4	12

Can you include what your actual desired result looks like? Not just an explanation of intermediate results — Michael Szczesny, Oct 02 '21 at 18:49
@SangeerththanBalachandran - I rejected your edit, because you can read this table into a dataframe with `pd.read_html('https://stackoverflow.com/q/69419264/14277722')[0]`. Your edit makes it harder. — Michael Szczesny, Oct 02 '21 at 19:21
Thanks for your feedback! The answer would simply be (for A): A: 3,1. It could be a dict or a df where the second column is a list, it doesn't matter. In reality there would be a B, C etc. in column one, so further results — mattmoore_bioinfo, Oct 02 '21 at 19:49

iacob · Accepted Answer · 2021-10-03T18:11:16.693

2

If I am understanding you correctly, I think you want to group by col3 instead of col2:

df = pd.read_html('https://stackoverflow.com/q/69419264/14277722')[0]

df = df.groupby(['col1','col3'])['col2'].apply(list).reset_index()
df['count'] = df['col2'].apply(len)

You can then remove rows where col2 is a subset of another row with the following:

arr = pd.get_dummies(df['col2'].explode()).max(level=0).to_numpy()
subsets = np.matmul(arr, arr.T)
np.fill_diagonal(subsets, 0)
mask = ~np.equal(subsets, np.sum(arr, 1)).any(0)

df = df[mask]

   col1 col3             col2  count
0     A   12  [ID1, ID2, ID4]      3
3     A   18            [ID3]      1

edited Oct 03 '21 at 18:11

answered Oct 02 '21 at 22:14

iacob

20,084
6
92
119

Duplicate lists in col2 should first be dropped: #Convert to hashable type df['gID'] = df['col2'].astype(str) #Drop duplicates df = df.groupby('col2_str').apply(lambda x: x.sample(1)).reset_index(drop=True) df = df.drop(['col2_str'], axis=1) – mattmoore_bioinfo Nov 25 '21 at 15:17

Pandas groupby two columns and count shared values in third

1 Answers1