5

I'm looking for a simple way in pandas to count all pairwise combination between 2 columns. For example:

df

Col1    Col2   Freq
 A        B     1
 B        A     1
 A        C     1
 C        B     1
 C        A     1

Expected output:

Col1    Col2   Freq
 A        B     2
 A        C     2
 C        B     1

thanks

user11666514
  • 165
  • 1
  • 8
  • 2
    `df[['Col1', 'Col2']] = np.sort(df[['Col1', 'Col2']], axis=1); df.groupby(['Col1', 'Col2']).size()` – user3483203 Nov 20 '19 at 19:10
  • It's `sum` instead of `size`: `df.groupby(['Col1', 'Col2']).sum().reset_index()`. Post it as an answer @user3483203 ? – Erfan Nov 20 '19 at 19:14
  • It's a clear dup of like 100 other posts. It's more a question of which one is the best to link (i.e. the one that clearly avoids a slow solution) – ALollz Nov 20 '19 at 19:16
  • [This question](https://stackoverflow.com/questions/58592606/find-symmetric-pairs-quickly-in-numpy/58592764#58592764) replaces `first()` with `sum()`. – Quang Hoang Nov 20 '19 at 19:23
  • I think the frozenset answer should be undeleted, it really was a good answer – oppressionslayer Nov 20 '19 at 20:00

3 Answers3

1

np.sort and groupby

The main trick is to sort each row when considering just the columns 'Col1' and 'Col2'. I use some unpacking trickery to get the first column of the sorted thing into x and the second into y.

I could've placed these into there own columns but that isn't necessary as groupby allows to group by external arrays. So I do exactly that.

x, y = np.sort(df[['Col1', 'Col2']].to_numpy(), axis=1).T
df.Freq.groupby([x, y]).sum()

A  B    2
   C    2
B  C    1
Name: Freq, dtype: int64
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Nice solution that seems to work, but can you explain it? The "groupby([x, y])" in particular really eludes me... – Tickon Feb 10 '20 at 15:20
1

You can try:

df2=(df
  .groupby(
       df[["Col1", "Col2"]]
       .apply(lambda x: str(sorted(x)), axis=1)
       )
  .agg({"Col1": "first", "Col2": "first", "Freq": "sum"})
  .reset_index(drop=True)
)

Output:

  Col1 Col2  Freq
0    A    B     2
1    A    C     2
2    C    B     1
Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34
0

While this is clearly a dup of others, I'm not sure if the following approach is mentioned somewhere:

s = (df.groupby(['Col1','Col2'])
       .Freq.sum()
       .unstack(-1, fill_value=0)
    )

np.triu(s) + np.triu(s.T,1)

Output:

array([[0, 2, 2],
       [0, 0, 1],
       [0, 0, 0]], dtype=int64)
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74