Pandas: remove duplicates that exist in any order

Question

My question is similar to Pandas: remove reverse duplicates from dataframe but I have an additional requirement. I need to maintain row value pairs.

For example:

I have data where column A corresponds to column C and column B corresponds to column D.

import pandas as pd

# Initial data frame
data = pd.DataFrame({'A': [0, 10, 11, 21, 22, 35, 5, 50], 
                     'B': [50, 22, 35, 5, 10, 11, 21, 0],
                     'C': ["a", "b", "r", "x", "c", "w", "z", "y"],
                     'D': ["y", "c", "w", "z", "b", "r", "x", "a"]})
data

#    A   B  C  D
#0   0  50  a  y
#1  10  22  b  c
#2  11  35  r  w
#3  21   5  x  z
#4  22  10  c  b
#5  35  11  w  r
#6   5  21  z  x
#7  50   0  y  a

I would like to remove duplicates that exist in columns A and B but I need to preserve their corresponding letter value in columns C and D.

I have a solution here but is there a more elegant way of doing this?

# Desired data frame
new_data = pd.DataFrame()

# Concat numbers and corresponding letters
new_data['AC'] = data['A'].astype(str) + ',' + data['C']
new_data['BD'] = data['B'].astype(str) + ',' + data['D']

# Drop duplicates despite order
new_data = new_data.apply(lambda r: sorted(r), axis = 1).drop_duplicates()

# Recreate dataframe
new_data = pd.DataFrame.from_items(zip(new_data.index, new_data.values)).T
new_data = pd.concat([new_data.iloc[:,0].str.split(',', expand=True),
                      new_data.iloc[:,1].str.split(',', expand=True)], axis=1)
new_data.columns=['A', 'B', 'C', 'D']
new_data

#    A  B   C  D
#0   0  a  50  y
#1  10  b  22  c
#2  11  r  35  w
#3  21  x   5  z

EDIT technically output should look like this:

new_data.columns=['A', 'C', 'B', 'D']
new_data

#    A  B   C  D
#0   0  a  50  y
#1  10  b  22  c
#2  11  r  35  w
#3  21  x   5  z

It doesn't really matter but it is my mistake and I will fix it now — BeeGee, Jul 30 '18 at 23:38

sacuL · Accepted Answer · 2018-07-30T23:44:31.557

8

I think that you can do this with stack, drop_duplicates and unstack:

data.set_index(['A','B']).stack().drop_duplicates().unstack().reset_index()

    A   B  C  D
0   0  50  a  y
1  10  22  b  c
2  11  35  r  w
3  21   5  x  z

edited Jul 30 '18 at 23:44

answered Jul 30 '18 at 23:35

sacuL

49,704
8
81
106

1

nice, if you set `['A', 'B']` as the index, the reordering at the end is not necessary, and the same result is obtained as `A implies C`, and `B implies D`. – Haleemur Ali Jul 30 '18 at 23:42
Is this drop duplicate by column C, D (not A, B) right ? – BENY Jul 30 '18 at 23:51
This solution only works if there is one duplicate per row. It just fits my specific example and will not work otherwise – BeeGee Jul 31 '18 at 21:54

score 1 · Answer 2 · answered Jul 30 '18 at 23:37

create two additional columns taking the data the sorted data in columns

columns = ['A', 'B']
df = pd.concat([data, pd.DataFrame(np.sort(data[columns], axis=1), axis=1)

drop duplicates using the sorted data & select the original columns

df.drop_duplicates(df.columns.difference(data.columns))[data.columns]
output:
    A   B  C  D
0   0  50  a  y
1  10  22  b  c
2  11  35  r  w
3  21   5  x  z

score 1 · Answer 3 · answered Jul 30 '18 at 23:44

Base on the link you provided

newdf=data[['A','B']].apply(lambda r: sorted(r), axis = 1).drop_duplicates()
newdf['C']=newdf.A.map(dict(zip(data.A,data.C)))
newdf['D']=newdf.B.map(dict(zip(data.B,data.D)))
newdf
Out[138]: 
    A   B  C  D
0   0  50  a  y
1  10  22  b  c
2  11  35  r  w
3   5  21  z  x

Pandas: remove duplicates that exist in any order

3 Answers3

Linked

Related