identification of duplicates in pandas dataframe

Question

I have the following dataframe 'matches_df', using python 2.7:

name | opponent | date     | win
'Bob'    'Bill'   7/12/16     Y
'Mike'   'Matt'   4/15/18     N
'Tim'    'Tom'    1/1/11      N
'Bill'   'Bob'    7/12/16     N

I want a list not containing duplicate games. These are games that have the same two players (not necessarily in the same columns) and took place on the same date. In the above example, games 1 and 4 are duplicates.

In order to address this, I tried to create a 4th column, game_id, that creates a sorted combination of the first 3 columns. Eg I wanted this result:

name | opponent | date    | win | game_id
'Bob'    'Bill'   7/12/16    Y   '7/12/16 Bill Bob'  
'Mike'   'Matt'   4/15/18    N   '4/15/18 Matt Mike'
'Tim'    'Tom'    1/1/11     N   '1/1/11 Tim Tom'
'Bill'   'Bob'    7/12/16    N   '7/12/16 Bill Bob'

I used the following code:

def sort_and_squash(a,b,c):
    return ''.join(sorted([str(a),str(b),str(c)]))
matches_df = matches_df.assign(game_id = lambda x: sort_and_squash(x.name,x.opponent,x.date))

However, this did not work as intended, creating a blank column in the data frame.

I am looking for help either in finding the error in my code for the intermediate step or recommending an alternative approach.

Quite overlapping with this answer: http://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-in-python-pandas. Also interesting if you want to remove duplicates only regarding some few columns. — PlagTag, Jan 02 '17 at 09:32

piRSquared · Accepted Answer · 2016-12-29T19:23:27.023

Sort the players in columns ['name', 'opponent']
If they are the same but in different positions, sorting will put them in the same position.

game_id_df = df[['date']].join(df[['name', 'opponent']].apply(sorted, 1))
print(game_id_df)

      date  name opponent
0  7/12/16  Bill      Bob
1  4/15/18  Matt     Mike
2   1/1/11   Tim      Tom
3  7/12/16  Bill      Bob

Then join strings and add column

df['game_id'] = game_id_df.apply(tuple, 1).str.join(' ')
print(df)

   name opponent     date win            game_id
0   Bob     Bill  7/12/16   Y   7/12/16 Bill Bob
1  Mike     Matt  4/15/18   N  4/15/18 Matt Mike
2   Tim      Tom   1/1/11   N     1/1/11 Tim Tom
3  Bill      Bob  7/12/16   N   7/12/16 Bill Bob

Now you can use game_id to drop duplicates

print(df.drop_duplicates(subset=['game_id']))

   name opponent     date win            game_id
0   Bob     Bill  7/12/16   Y   7/12/16 Bill Bob
1  Mike     Matt  4/15/18   N  4/15/18 Matt Mike
2   Tim      Tom   1/1/11   N     1/1/11 Tim Tom

score 1 · Answer 2 · answered Dec 29 '16 at 20:07

Although this has already been answered by piRSquared, if you'd rather have something closer to your original approach (or you want to understand why your original approach did not work), you can try this.

def sort_and_squash(df):
    return [' '.join(sorted([d.strftime('%m/%d/%Y'), n, o]))
            for d, n, o in zip(df.date, df.name, df.opponent)]

matches_df = matches_df.assign(game_id=sort_and_squash)

The function that is passed to the assign method expects as an argument the dataframe and is expected to return the whole new column. You need something like the list comprehension above, to make that work.

>>> print matches_df
        date  name opponent win               game_id
0 2016-07-12   Bob     Bill   Y   07/12/2016 Bill Bob
1 2018-04-15  Mike     Matt   N  04/15/2018 Matt Mike
2 2011-01-01   Tim      Tom   N    01/01/2011 Tim Tom
3 2016-07-12  Bill      Bob   N   07/12/2016 Bill Bob

Of course, this just inserts the game_id column; it does not eliminate duplicates. To eliminate them, you also need:

matches_df = matches_df.drop_duplicates(subset=['game_id'])

and then:

>>> print matches_df
        date  name opponent win               game_id
0 2016-07-12   Bob     Bill   Y   07/12/2016 Bill Bob
1 2018-04-15  Mike     Matt   N  04/15/2018 Matt Mike
2 2011-01-01   Tim      Tom   N    01/01/2011 Tim Tom

As last touch, if you don't need the game_id column any more, you can drop it with:

matches_df = matches_df.drop('game_id', 1)

which gives you:

>>> print matches_df
        date  name opponent win
0 2016-07-12   Bob     Bill   Y
1 2018-04-15  Mike     Matt   N
2 2011-01-01   Tim      Tom   N

identification of duplicates in pandas dataframe

2 Answers2