1

Given a Pandas Dataframe as follows:

   value_1  value_2
0    b     a
1    a     b
2    c     d
3    d     c

I'd like to count how many times each pair ocurred, despite order. In other words a,b is equal to b,a. I found some cases, like that, but in this case some values are expected to appear in only one column.

First, I try something like:

pd.crosstab(df.value_1,df.value_2).replace(0,np.nan).\
     stack().reset_index().rename(columns={0:'Frequency'})

Through this method I also could not remove the duplicates taking into account the order (a,b == b, a)

Then I assume the case of a list like [(a,b), (b,a), (c,d), (d,c)], so I created a function to return only the single pairs according to their elements:

from itertools import combinations

def get_unique_pairs(pairs):
    unique_pairs = []
    for pair in combinations(pairs, 2):
        if Counter(pair[0]) == Counter(pair[1]):
            unordered_pair = set(list(pair[0]+pair[1]))
            unique_pairs.append(unordered_pair)
    return {(tuple(e)) for e in unique_pairs}

Their output for [(a,b), (b,a), (c,d), (d,c)] is {(a,b), (c,d)}

But, it is not viable for a big dataframe.

finally, for the dataframe described above, the expected output is:

   pair frequency
0    b; a   2      
1    c; d   2
Kfcaio
  • 442
  • 1
  • 8
  • 20

1 Answers1

2

first sorting values per rows with numpy.sort:

df1 = pd.DataFrame(np.sort(df.values, axis=1))

Then join together and use Series.value_counts:

df = (df1[0] + '; ' + df1[1]).value_counts().rename_axis('pair').reset_index(name='count')
print (df)
   pair  count
0  a; b      2
1  c; d      2

Alternative with GroupBy.size:

s = (df1[0] + '; ' + df1[1])
df = s.groupby(s).size().rename_axis('pair').reset_index(name='count')
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252