15
from itertools import product
import pandas as pd

df = pd.DataFrame.from_records(product(range(10), range(10)))
df = df.sample(90)
df.columns = "c1 c2".split()
df = df.sort_values(df.columns.tolist()).reset_index(drop=True)
#     c1  c2
# 0    0   0
# 1    0   1
# 2    0   2
# 3    0   3
# 4    0   4
# ..  ..  ..
# 85   9   4
# 86   9   5
# 87   9   7
# 88   9   8
# 89   9   9
# 
# [90 rows x 2 columns]

How do I quickly find, identify, and remove the last duplicate of all symmetric pairs in this data frame?

An example of symmetric pair is that '(0, 1)' is equal to '(1, 0)'. The latter should be removed.

The algorithm must be fast, so it is recommended to use numpy. Converting to python object is not allowed.

SK98
  • 25
  • 6
The Unfun Cat
  • 29,987
  • 31
  • 114
  • 156

6 Answers6

13

You can sort the values, then groupby:

a= np.sort(df.to_numpy(), axis=1)
df.groupby([a[:,0], a[:,1]], as_index=False, sort=False).first()

Option 2: If you have a lot of pairs c1, c2, groupby can be slow. In that case, we can assign new values and filter by drop_duplicates:

a= np.sort(df.to_numpy(), axis=1) 

(df.assign(one=a[:,0], two=a[:,1])   # one and two can be changed
   .drop_duplicates(['one','two'])   # taken from above
   .reindex(df.columns, axis=1)
)
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
7

One way is using np.unique with return_index=True and use the result to index the dataframe:

a = np.sort(df.values)
_, ix = np.unique(a, return_index=True, axis=0)

print(df.iloc[ix, :])

    c1  c2
0    0   0
1    0   1
20   2   0
3    0   3
40   4   0
50   5   0
6    0   6
70   7   0
8    0   8
9    0   9
11   1   1
21   2   1
13   1   3
41   4   1
51   5   1
16   1   6
71   7   1
...
yatu
  • 86,083
  • 12
  • 84
  • 139
6

frozenset

mask = pd.Series(map(frozenset, zip(df.c1, df.c2))).duplicated()

df[~mask]
piRSquared
  • 285,575
  • 57
  • 475
  • 624
5

I will do

df[~pd.DataFrame(np.sort(df.values,1)).duplicated().values]

From pandas and numpy tri

s=pd.crosstab(df.c1,df.c2)
s=s.mask(np.triu(np.ones(s.shape)).astype(np.bool) & s==0).stack().reset_index()
BENY
  • 317,841
  • 20
  • 164
  • 234
5

Here's one NumPy based one for integers -

def remove_symm_pairs(df):
    a = df.to_numpy(copy=False)
    b = np.sort(a,axis=1)
    idx = np.ravel_multi_index(b.T,(b.max(0)+1))
    sidx = idx.argsort(kind='mergesort')
    p = idx[sidx]
    m = np.r_[True,p[:-1]!=p[1:]]
    a_out = a[np.sort(sidx[m])]
    df_out = pd.DataFrame(a_out)
    return df_out

If you want to keep the index data as it is, use return df.iloc[np.sort(sidx[m])].

For generic numbers (ints/floats, etc.), we will use a view-based one -

# https://stackoverflow.com/a/44999009/ @Divakar
def view1D(a): # a is array
    a = np.ascontiguousarray(a)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel()

and simply replace the step to get idx with idx = view1D(b) in remove_symm_pairs.

Divakar
  • 218,885
  • 19
  • 262
  • 358
2

If this needs to be fast, and if your variables are integer, then the following trick may help: let v,w be the columns of your vector; construct [v+w, np.abs(v-w)] =: [x, y]; then sort this matrix lexicographically, remove duplicates, and finally map it back to [v, w] = [(x+y), (x-y)]/2.

Federico Poloni
  • 668
  • 8
  • 24