0

Given a dataframe

data = {
        "col1": ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D'], 
        "col2": ['B', 'C', 'A', 'C', 'D', 'B', 'E', 'A',],
        "count": [3, 7, 12, 2, 8, 2, 5, 9]
        }
df = pd.DataFrame(data=data)

col1 col2 count
 A    B     3
 A    C     7
 B    A     12
 B    C     2
 C    D     8
 C    B     2
 D    E     5
 D    A     9

I want to aggregate all the rows that have symmetric pairs while maintaining the rows that don't. Therefore, the output would be a new dataframe

col1 col2 count
 A    B     15
 A    C     7
 B    C     4
 C    D     8
 D    E     5
 D    A     9

I have looked at Aggregate symmetric pairs pandas and Find symmetric pairs quickly in numpy but neither of them particularly help me.

DrakeMurdoch
  • 765
  • 11
  • 26

2 Answers2

3

Actually, the given linked used apply, which generally is slow. Here's a solution with np.sort:

s = np.sort(df[['col1','col2']], axis=1)
df.groupby([s[:,0],s[:,1]])['count'].sum().reset_index()

Output:

  level_0 level_1  count
0       A       B     15
1       A       C      7
2       A       D      9
3       B       C      4
4       C       D      8
5       D       E      5
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
2

Some small modifications to the answer you linked does what you want:

df1 = df[['col1', 'col2']].apply(sorted, 1, result_type = "expand").rename(columns = {0:'col1', 1:'col2'})

df.groupby([df1.col1, df1.col2]).sum().reset_index()

  col1 col2  count
0    A    B     15
1    A    C      7
2    A    D      9
3    B    C      4
4    C    D      8
5    D    E      5
sacuL
  • 49,704
  • 8
  • 81
  • 106