I have the following dataframe 'matches_df', using python 2.7:
name | opponent | date | win
'Bob' 'Bill' 7/12/16 Y
'Mike' 'Matt' 4/15/18 N
'Tim' 'Tom' 1/1/11 N
'Bill' 'Bob' 7/12/16 N
I want a list not containing duplicate games. These are games that have the same two players (not necessarily in the same columns) and took place on the same date. In the above example, games 1 and 4 are duplicates.
In order to address this, I tried to create a 4th column, game_id, that creates a sorted combination of the first 3 columns. Eg I wanted this result:
name | opponent | date | win | game_id
'Bob' 'Bill' 7/12/16 Y '7/12/16 Bill Bob'
'Mike' 'Matt' 4/15/18 N '4/15/18 Matt Mike'
'Tim' 'Tom' 1/1/11 N '1/1/11 Tim Tom'
'Bill' 'Bob' 7/12/16 N '7/12/16 Bill Bob'
I used the following code:
def sort_and_squash(a,b,c):
return ''.join(sorted([str(a),str(b),str(c)]))
matches_df = matches_df.assign(game_id = lambda x: sort_and_squash(x.name,x.opponent,x.date))
However, this did not work as intended, creating a blank column in the data frame.
I am looking for help either in finding the error in my code for the intermediate step or recommending an alternative approach.