4

I have a dataframe like this:

source   target   weight
     1       2         5
     2       1         5
     1       2         5
     1       2         7
     3       1         6
     1       1         6
     1       3         6

My goal is to remove the duplicate rows, but the order of source and target columns are not important. In fact, the order of two columns are not important and they should be removed. In this case, the expected result would be

source   target   weight
     1       2         5
     1       2         7
     3       1         6
     1       1         6

Is there any way to this without loops?

piRSquared
  • 285,575
  • 57
  • 475
  • 624
Elham
  • 272
  • 2
  • 11
  • 1
    See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html and https://stackoverflow.com/a/34272155/1265980 – Ereli Jun 21 '18 at 18:59
  • 1
    @VenkataGogu it is not a duplicate of that question. Try it `df = pd.DataFrame({'a': [1, 2, 3], 'b': [2, 1, 1], 'c': [1, 3, 2]})` and `df = df.drop_duplicates(subset=['a', 'b'], keep=False)`. All 3 rows still exist. The question title is specific and the data is clear; the OP wants to drop duplicates where values can appear across a subset of the columns, but which column they appear in is not important – roganjosh Jun 21 '18 at 19:14
  • 1
    And actually I can't think of an elegant way to do it without it spiraling out of control as the number of columns increases. Nice question. – roganjosh Jun 21 '18 at 19:19
  • Actually, the value of weight (third column) is important. – Elham Jun 21 '18 at 19:55
  • 2
    You just updated your expected outcome, can you explain what changed? – user3483203 Jun 21 '18 at 19:57
  • In this case, another row has been added just to show that if weight is different the rows should not be deleted. – Elham Jun 21 '18 at 20:01

2 Answers2

4

Use frozenset and duplicated

df[~df[['source', 'target']].apply(frozenset, 1).duplicated()]

   source  target  weight
0       1       2       5
3       3       1       6
4       1       1       6

If you want to account for unordered source/target and weight

df[~df[['weight']].assign(A=df[['source', 'target']].apply(frozenset, 1)).duplicated()]

   source  target  weight
0       1       2       5
3       1       2       7
4       3       1       6
5       1       1       6

However, to be explicit with more readable code.

# Create series where values are frozensets and therefore hashable.
# With hashable things, we can determine duplicity.
# Note that I also set the index and name to set up for a convenient `join`
s = pd.Series(list(map(frozenset, zip(df.source, df.target))), df.index, name='mixed')

# Use `drop` to focus on just those columns leaving whatever else is there.
# This is more general and accommodates more than just a `weight` column.
mask = df.drop(['source', 'target'], axis=1).join(s).duplicated()

df[~mask]

   source  target  weight
0       1       2       5
3       1       2       7
4       3       1       6
5       1       1       6
piRSquared
  • 285,575
  • 57
  • 475
  • 624
0

Should be fairly easy.

data = [[1,2,5],
[2,1,5],
[1,2,5],
[3,1,6],
[1,1,6],
[1,3,6],
]
df = pd.DataFrame(data,columns=['source','target','weight'])

You can drop the duplicates using drop_duplicates

df = df.drop_duplicates(keep=False)
print(df)

would result in:

      source  target  weight
1       2       1       5
3       3       1       6
4       1       1       6
5       1       3       6

because you want to handle the unordered source/target issue.

def pair(row):
    sorted_pair = sorted([row['source'],row['target']])
    row['source'] =  sorted_pair[0]
    row['target'] = sorted_pair[1]
    return row
df = df.apply(pair,axis=1)

and then you can use df.drop_duplicates()

   source  target  weight
0       1       2       5
3       1       2       7
4       1       3       6
5       1       1       6
Ereli
  • 965
  • 20
  • 34
  • row number 3,5 are still duplicates – Venkata Gogu Jun 21 '18 at 19:41
  • ... I even gave comments under the question on why it wasn't as simple as you're making out. Your output doesn't even match the OP's expected output. – roganjosh Jun 21 '18 at 19:49
  • I've removed my downvote, but this does now default to python speed – roganjosh Jun 21 '18 at 20:14
  • that's true, any solution that uses native python types instead of numpy types and operators will result in normal CPython execution speed. This is also true for piRSquared's solution. – Ereli Jun 21 '18 at 20:41