5

My question is similar to Pandas: remove reverse duplicates from dataframe but I have an additional requirement. I need to maintain row value pairs.

For example:

I have data where column A corresponds to column C and column B corresponds to column D.

import pandas as pd

# Initial data frame
data = pd.DataFrame({'A': [0, 10, 11, 21, 22, 35, 5, 50], 
                     'B': [50, 22, 35, 5, 10, 11, 21, 0],
                     'C': ["a", "b", "r", "x", "c", "w", "z", "y"],
                     'D': ["y", "c", "w", "z", "b", "r", "x", "a"]})
data

#    A   B  C  D
#0   0  50  a  y
#1  10  22  b  c
#2  11  35  r  w
#3  21   5  x  z
#4  22  10  c  b
#5  35  11  w  r
#6   5  21  z  x
#7  50   0  y  a

I would like to remove duplicates that exist in columns A and B but I need to preserve their corresponding letter value in columns C and D.

I have a solution here but is there a more elegant way of doing this?

# Desired data frame
new_data = pd.DataFrame()

# Concat numbers and corresponding letters
new_data['AC'] = data['A'].astype(str) + ',' + data['C']
new_data['BD'] = data['B'].astype(str) + ',' + data['D']

# Drop duplicates despite order
new_data = new_data.apply(lambda r: sorted(r), axis = 1).drop_duplicates()

# Recreate dataframe
new_data = pd.DataFrame.from_items(zip(new_data.index, new_data.values)).T
new_data = pd.concat([new_data.iloc[:,0].str.split(',', expand=True),
                      new_data.iloc[:,1].str.split(',', expand=True)], axis=1)
new_data.columns=['A', 'B', 'C', 'D']
new_data

#    A  B   C  D
#0   0  a  50  y
#1  10  b  22  c
#2  11  r  35  w
#3  21  x   5  z

EDIT technically output should look like this:

new_data.columns=['A', 'C', 'B', 'D']
new_data

#    A  B   C  D
#0   0  a  50  y
#1  10  b  22  c
#2  11  r  35  w
#3  21  x   5  z
BeeGee
  • 815
  • 2
  • 17
  • 33

3 Answers3

8

I think that you can do this with stack, drop_duplicates and unstack:

data.set_index(['A','B']).stack().drop_duplicates().unstack().reset_index()

    A   B  C  D
0   0  50  a  y
1  10  22  b  c
2  11  35  r  w
3  21   5  x  z
sacuL
  • 49,704
  • 8
  • 81
  • 106
  • 1
    nice, if you set `['A', 'B']` as the index, the reordering at the end is not necessary, and the same result is obtained as `A implies C`, and `B implies D`. – Haleemur Ali Jul 30 '18 at 23:42
  • Is this drop duplicate by column C, D (not A, B) right ? – BENY Jul 30 '18 at 23:51
  • This solution only works if there is one duplicate per row. It just fits my specific example and will not work otherwise – BeeGee Jul 31 '18 at 21:54
1

create two additional columns taking the data the sorted data in columns

columns = ['A', 'B']
df = pd.concat([data, pd.DataFrame(np.sort(data[columns], axis=1), axis=1)

drop duplicates using the sorted data & select the original columns

df.drop_duplicates(df.columns.difference(data.columns))[data.columns]
output:
    A   B  C  D
0   0  50  a  y
1  10  22  b  c
2  11  35  r  w
3  21   5  x  z
Haleemur Ali
  • 26,718
  • 5
  • 61
  • 85
1

Base on the link you provided

newdf=data[['A','B']].apply(lambda r: sorted(r), axis = 1).drop_duplicates()
newdf['C']=newdf.A.map(dict(zip(data.A,data.C)))
newdf['D']=newdf.B.map(dict(zip(data.B,data.D)))
newdf
Out[138]: 
    A   B  C  D
0   0  50  a  y
1  10  22  b  c
2  11  35  r  w
3   5  21  z  x
BENY
  • 317,841
  • 20
  • 164
  • 234