Removing rows with redundant data

Question

I am working with pair of texts. Some data is redundant for different columns of the data frame. For example, the screenshot of data frame is following:

COLUMN_A	COLUMN_B
a	x
b	y
x	a

Here row 0 has (a,x) and row 2 has (x,a), which is redundant in my case and has to be deleted. I am building a huge dataset for comparing the semantic similarity between two pairs of texts. At the moment, I want to compare each row with all of the other rows of the same data frame to remove duplicates. How can I compare both columns of each row with all of the other rows?

This is row level duplicates regardless of column order? Or cell level duplicates? Like would row 0 `x a` and row 1 `x x` be considered duplicates since `x` already appears in the Frame or no only `a x` `x a` because both `a` and `x` appear in the row? — Henry Ecker, Jun 20 '21 at 02:12
only row0 "a-x" and and row 2"x-a" are considered as duplicates — Anna Martin, Jun 20 '21 at 22:34

score 0 · Answer 1 · answered Jun 20 '21 at 02:51

0

Try this:

col_list = ['COULMN_A','COULMN_B']
df[col_list] = np.sort(df[col_list].to_numpy(),axis=1)
df = df.drop_duplicates(subset = col_list)

answered Jun 20 '21 at 02:51

rhug123

7,893
1
9
24

score 0 · Answer 2 · answered Jun 20 '21 at 04:59

You can use an auxiliary column to concatenate the text in the other columns and then drop everything you need.

Like this:

df['aux'] = df['COULMN_A'].str.cat(df['COULMN_B'])
df['aux'] = list(map(lambda x : ''.join(sorted(x)), df['aux']))
df.drop_duplicates(subset=['aux'], inplace=True)
df.drop(columns=['aux'],inplace=True)

Removing rows with redundant data

2 Answers2