Find symmetric pairs quickly in numpy

Question

from itertools import product
import pandas as pd

df = pd.DataFrame.from_records(product(range(10), range(10)))
df = df.sample(90)
df.columns = "c1 c2".split()
df = df.sort_values(df.columns.tolist()).reset_index(drop=True)
#     c1  c2
# 0    0   0
# 1    0   1
# 2    0   2
# 3    0   3
# 4    0   4
# ..  ..  ..
# 85   9   4
# 86   9   5
# 87   9   7
# 88   9   8
# 89   9   9
# 
# [90 rows x 2 columns]

How do I quickly find, identify, and remove the last duplicate of all symmetric pairs in this data frame?

An example of symmetric pair is that '(0, 1)' is equal to '(1, 0)'. The latter should be removed.

The algorithm must be fast, so it is recommended to use numpy. Converting to python object is not allowed.

Could you give an example of what you understand by `symmetric pairs`? — yatu, Oct 28 '19 at 14:21
@JerryM. Yes, but it is trivial to remove with `df.drop_duplicates()` — The Unfun Cat, Oct 28 '19 at 14:25
@molybdenum42 I use itertools product to create an example, the data themselves are not created with itertools product. — The Unfun Cat, Oct 28 '19 at 14:27
It is a general problem. The way I constructed the data is not relevant :) — The Unfun Cat, Oct 28 '19 at 14:28

Quang Hoang · Accepted Answer · 2019-10-28T14:35:03.660

You can sort the values, then groupby:

a= np.sort(df.to_numpy(), axis=1)
df.groupby([a[:,0], a[:,1]], as_index=False, sort=False).first()

Option 2: If you have a lot of pairs c1, c2, groupby can be slow. In that case, we can assign new values and filter by drop_duplicates:

a= np.sort(df.to_numpy(), axis=1) 

(df.assign(one=a[:,0], two=a[:,1])   # one and two can be changed
   .drop_duplicates(['one','two'])   # taken from above
   .reindex(df.columns, axis=1)
)

score 7 · Answer 2 · answered Oct 28 '19 at 14:32

One way is using np.unique with return_index=True and use the result to index the dataframe:

a = np.sort(df.values)
_, ix = np.unique(a, return_index=True, axis=0)

print(df.iloc[ix, :])

    c1  c2
0    0   0
1    0   1
20   2   0
3    0   3
40   4   0
50   5   0
6    0   6
70   7   0
8    0   8
9    0   9
11   1   1
21   2   1
13   1   3
41   4   1
51   5   1
16   1   6
71   7   1
...

score 6 · Answer 3 · answered Oct 28 '19 at 14:30

6

`frozenset`

mask = pd.Series(map(frozenset, zip(df.c1, df.c2))).duplicated()

df[~mask]

answered Oct 28 '19 at 14:30

piRSquared

285,575
57
475
624

1

Aren't you iterating slowly over tuples over each column here? Still, upvote. – The Unfun Cat Oct 28 '19 at 14:31
Yes, I'm iterating. No, it isn't as slow as you think. – piRSquared Oct 28 '19 at 14:32

score 5 · Answer 4 · answered Oct 28 '19 at 14:31

5

I will do

df[~pd.DataFrame(np.sort(df.values,1)).duplicated().values]

From pandas and numpy tri

s=pd.crosstab(df.c1,df.c2)
s=s.mask(np.triu(np.ones(s.shape)).astype(np.bool) & s==0).stack().reset_index()

answered Oct 28 '19 at 14:31

BENY

317,841
20
164
234

Divakar · Answer 5 · 2019-10-28T15:22:55.247

Here's one NumPy based one for integers -

def remove_symm_pairs(df):
    a = df.to_numpy(copy=False)
    b = np.sort(a,axis=1)
    idx = np.ravel_multi_index(b.T,(b.max(0)+1))
    sidx = idx.argsort(kind='mergesort')
    p = idx[sidx]
    m = np.r_[True,p[:-1]!=p[1:]]
    a_out = a[np.sort(sidx[m])]
    df_out = pd.DataFrame(a_out)
    return df_out

If you want to keep the index data as it is, use return df.iloc[np.sort(sidx[m])].

For generic numbers (ints/floats, etc.), we will use a view-based one -

# https://stackoverflow.com/a/44999009/ @Divakar
def view1D(a): # a is array
    a = np.ascontiguousarray(a)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel()

and simply replace the step to get idx with idx = view1D(b) in remove_symm_pairs.

score 2 · Answer 6 · answered Oct 29 '19 at 07:51

If this needs to be fast, and if your variables are integer, then the following trick may help: let v,w be the columns of your vector; construct [v+w, np.abs(v-w)] =: [x, y]; then sort this matrix lexicographically, remove duplicates, and finally map it back to [v, w] = [(x+y), (x-y)]/2.

Find symmetric pairs quickly in numpy

6 Answers6

`frozenset`

Linked

Related