How to remove conceptual duplicates?

Question

I have to deal with a dataset similar to this one:

    Player  CurrentTeam  PreviousTeam
0   James   Team A       Null
1   David   Team B       Null
2   George  Team C       Null
3   James   Team B       Team A
4   James   Team C       Team B
... ...

In the above example, James is in fact the same player and appears 3 times in the dataset. There is no ID or such a thing.

What happens here is that James is in fact playing for Team C. Before Team C he was in Team B and even before he was in Team A (where his career started).

However, I only need to know about his real current team (i.e. Team C).

Can I get rid of all the other occurences of James' past teams?

You can't, in general. You could have a cycle in which case it's impossible to tell what the current team is. Unless the database is ordered in terms of dates of transfers, in which case you can just look at the last entry. — orlp, Nov 05 '20 at 08:57
@orlp I guess this could happen in case he decides to play for a team twitce. What if he could only play in a team only once and thus, never go back? — , Nov 05 '20 at 08:58
@orlp We can guarantee it in the certain example. Would I have to write vanilla Python in order to find the solution though? Or is there a pandas-like way to do it? — , Nov 05 '20 at 09:01
If the entries are ondered in terms of transfers, you can simply do df.drop_duplicates(['Player'], keep='last') to get the rows with the current latest team of each player — Colle, Nov 05 '20 at 09:04
@Manakin There are dates, but they refer to the dates that each record was written to the database rather than the date of the transfer. — , Nov 05 '20 at 09:40
then you just need to take the record where the date field is the max per `Player` right? — Umar.H, Nov 05 '20 at 09:42
@Manakin I wish that was the case. Some transfers have been registered the same date (for the same player). — , Nov 05 '20 at 09:44
@kakarito I assume there is no timestamp with the date? then you could the max row number per player I think Ansve's solution will work in this case. — Umar.H, Nov 05 '20 at 10:22

score 1 · Accepted Answer · answered Nov 05 '20 at 09:19

Based on the technique in this answer we can get rid of any (player, current_team) combo that exists as a (player, previousteam) combo.

>>> df
   Player CurrentTeam PreviousTeam
0   James      Team A         Null
1   David      Team B         Null
2  George      Team C         Null
3   James      Team B       Team A
4   James      Team C       Team B

>>> old_teams = df[["Player", "PreviousTeam"]].rename(columns={"PreviousTeam": "Team"})
>>> all_teams = df[["Player", "CurrentTeam"]].rename(columns={"CurrentTeam": "Team"})
>>> current_teams = (pd.merge(all_teams, old_teams, indicator=True, how="outer")
                       .query("_merge == 'left_only'")
                       .drop("_merge", axis=1)
                       .reset_index(drop=True))
>>> current_teams
   Player    Team
0   David  Team B
1  George  Team C
2   James  Team C

ansev · Answer 2 · 2020-11-05T09:53:14.567

1

Use DataFrame.drop_duplicates

new_df = df.drop_duplicates('Player', keep='last')
print(new_df)

If you want to check PreviousTeam use:

df.loc[~df.groupby('Player')['PreviousTeam']
          .shift(-1)
          .eq(df['CurrentTeam'])]

Output

   Player CurrentTeam PreviousTeam
1   David      Team B         Null
2  George      Team C         Null
4   James      Team C       Team B

edited Nov 05 '20 at 09:53

answered Nov 05 '20 at 09:26

ansev

30,322
5
17
31

This only works if the assumption that transfers are ordered in increasing date holds true. You should include that assumption in your answer. – orlp Nov 05 '20 at 09:33
OP said : *There is no ID or such a thing. ---- However, I only need to know about his real current team (i.e. Team C).* if OP doesn't use time series to sort the data I think this can be understood, but I understand your point of view :) – ansev Nov 05 '20 at 09:47

Grayrigel · Answer 3 · 2020-11-05T10:13:20.210

0

One possible solution for this case is to use groupby with tail:

df = df.groupby('Player').tail(1)
print(df)

Output:

    Player CurrentTeam PreviousTeam
1   David       TeamB         Null
2  George       TeamC         Null
4   James       TeamC        TeamB

edited Nov 05 '20 at 10:13

answered Nov 05 '20 at 09:42

Grayrigel

3,474
5
14
32

How to remove conceptual duplicates?

3 Answers3