-1

I have following dataframe:

data = {
    'person1_name': ['John_Ethan_Wayne', 'John_Ethan_Wayne', 'Michael_Wayne', 'Michael_Wayne', 'Patrick_Wayne', 'Patrick_Wayne'],
    'family1_name': ['Wayne', 'Wayne', 'Wayne', 'Wayne', 'Wayne', 'Wayne'],
    'person2_name': ['Michael_Wayne', 'Patrick_Wayne', 'Patrick_Wayne', 'John_Ethan_Wayne', 'John_Ethan_Wayne', 'Michael_Wayne'],
    'family2_name': ['Wayne', 'Wayne', 'Wayne', 'Wayne', 'Wayne', 'Wayne']
}

df = pd.DataFrame(data)

     person1_name family1_name      person2_name family2_name
 John_Ethan_Wayne        Wayne     Michael_Wayne        Wayne
 John_Ethan_Wayne        Wayne     Patrick_Wayne        Wayne
    Michael_Wayne        Wayne     Patrick_Wayne        Wayne
    Michael_Wayne        Wayne  John_Ethan_Wayne        Wayne
    Patrick_Wayne        Wayne  John_Ethan_Wayne        Wayne
    Patrick_Wayne        Wayne     Michael_Wayne        Wayne

I want to drop duplicates of (person1_name, family1_name) and (person2_name, family2_name) ignoring the direction of relation.

The final result should be:

     person1_name family1_name      person2_name family2_name
 John_Ethan_Wayne        Wayne     Michael_Wayne        Wayne
    Michael_Wayne        Wayne     Patrick_Wayne        Wayne
    Patrick_Wayne        Wayne  John_Ethan_Wayne        Wayne
Night Walker
  • 20,638
  • 52
  • 151
  • 228

3 Answers3

0

In the example you gave, the following is sufficient:

df[df.person1_name < df.person2_name]

This is because in the case of two rows:

A, B

B, A

It removes B, A because B < A evaluates to False.

Mark
  • 7,785
  • 2
  • 14
  • 34
0
import pandas as pd

data = {
    'person1_name': ['John_Ethan_Wayne', 'John_Ethan_Wayne', 'Michael_Wayne', 'Michael_Wayne', 'Patrick_Wayne', 'Patrick_Wayne'],
    'family1_name': ['Wayne', 'Wayne', 'Wayne', 'Wayne', 'Wayne', 'Wayne'],
    'person2_name': ['Michael_Wayne', 'Patrick_Wayne', 'Patrick_Wayne', 'John_Ethan_Wayne', 'John_Ethan_Wayne', 'Michael_Wayne'],
    'family2_name': ['Wayne', 'Wayne', 'Wayne', 'Wayne', 'Wayne', 'Wayne']
}

df = pd.DataFrame(data)

df['combined'] = df.apply(lambda row: frozenset({(row['person1_name'], row['family1_name']), (row['person2_name'], row['family2_name'])}), axis=1)

df = df.drop_duplicates(subset=['combined'])

df = df.sort_values(by=['person1_name', 'person2_name'])

df = df.reset_index(drop=True)

df = df.drop(columns='combined')

print(df)
Amira Bedhiafi
  • 8,088
  • 6
  • 24
  • 60
0
df['combined_names'] = df[['person1_name', 'family1_name', 'person2_name', 'family2_name']].agg(sorted, axis=1)

unique_combinations = df.drop_duplicates(subset='combined_names').drop(columns='combined_names')
Night Walker
  • 20,638
  • 52
  • 151
  • 228