0

I am working with pair of texts. Some data is redundant for different columns of the data frame. For example, the screenshot of data frame is following: img

COLUMN_A COLUMN_B
a x
b y
x a

Here row 0 has (a,x) and row 2 has (x,a), which is redundant in my case and has to be deleted. I am building a huge dataset for comparing the semantic similarity between two pairs of texts. At the moment, I want to compare each row with all of the other rows of the same data frame to remove duplicates. How can I compare both columns of each row with all of the other rows?

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • 1
    This is row level duplicates regardless of column order? Or cell level duplicates? Like would row 0 `x a` and row 1 `x x` be considered duplicates since `x` already appears in the Frame or no only `a x` `x a` because both `a` and `x` appear in the row? – Henry Ecker Jun 20 '21 at 02:12
  • only row0 "a-x" and and row 2"x-a" are considered as duplicates – Anna Martin Jun 20 '21 at 22:34

2 Answers2

0

Try this:

col_list = ['COULMN_A','COULMN_B']
df[col_list] = np.sort(df[col_list].to_numpy(),axis=1)
df = df.drop_duplicates(subset = col_list)
rhug123
  • 7,893
  • 1
  • 9
  • 24
0

You can use an auxiliary column to concatenate the text in the other columns and then drop everything you need.

Like this:

df['aux'] = df['COULMN_A'].str.cat(df['COULMN_B'])
df['aux'] = list(map(lambda x : ''.join(sorted(x)), df['aux']))
df.drop_duplicates(subset=['aux'], inplace=True)
df.drop(columns=['aux'],inplace=True)