How to remove and filter repeated entries in dataframe

Question

I have a dataframe with the following 3 columns

  ID       Department      Number
---------------------------------
2324              Art           4
2324             Math           1
2324              Art           3
2400          Science           2
2593             Tech           5
2593             Math           1

I'm trying to filter first by ID, then by Number. As you can see, some IDs repeat. I want to first find the IDs that repeat, then choose the highest corresponding Number.

For instance, as you can see the ID 2324 repeats 3 times. (4, 1, 3) are the Numbers that correspond with each entry of 2324. Since 4 is the bigger number, I choose the entry with 4. I want to filter the dataframe to get this output:

   ID       Department      Number
---------------------------------
2324              Art           4
2400          Science           2
2593             Tech           5

This is my code so far :

for previous, current in zip(df['ID'], df['ID'][1:]):
   for i, j in zip(df['Number'], df['Number'][1:]):
     if previous == current:
        if j> i:

However, I don't know what I should add next to correctly print(previous, j). If I add print(previous, j) to the nested loop, I get repeated entries. For instance, if my code was

for previous, current in zip(df['ID'], df['ID'][1:]):
   for i, j in zip(df['Number'], df['Number'][1:]):
     if previous == current:
        if j> i:
          print(previous, j)

it outputs

2324       4
2324       4
2324       4
2324       4
2324       4
2324       4        
2593       5   
2593       5
2593       5   
2593       5

I want it to output:

2324       4     
2593       5

I also want it to include the ID that was not repeated, so 2400 2 in this case. Additionally, I don't know how to append the correct Department name to the nested loop.

Thank you to anyone who took the time to read this and help me out. I really appreciate it.

This one is probably the exact solution you are looking for [Python Pandas Dataframe select row by max value in group](https://stackoverflow.com/questions/32459325/python-pandas-dataframe-select-row-by-max-value-in-group) — ThePyGuy, Sep 14 '21 at 05:20
You don't need loop to achieve what you want, you can first sort the values of your Number column and then use the drop_duplciates, like this: `df.sort_values(by=['num'], ascending=False)` followed by `df.drop_duplicates("id", inplace=True)`, by default, the drop_duplicates will keep the first entry. — user2906838, Sep 14 '21 at 05:24

score 1 · Accepted Answer · edited Sep 14 '21 at 05:59

1

Use:

df.loc[df.groupby('ID')['Number'].idxmax()]

edited Sep 14 '21 at 05:59

jezrael

822,522
95
1,334
1,252

answered Sep 14 '21 at 05:15

U13-Forward

69,221
14
89
114

It is dupe `loc + groupby + idxmax` – jezrael Sep 14 '21 at 05:19
@jezrael wikied – U13-Forward Sep 14 '21 at 05:19
@U12-Forward actually, one more question, suppose the 'Department' isn't first? How do I factor that in? – shorttriptomars Sep 14 '21 at 05:24
@shorttriptomars Maybe `df.groupby('ID', as_index=False).max()` – U13-Forward Sep 14 '21 at 05:26
This is wrong. Worked only since each group's max is in the first row... – Chris Sep 14 '21 at 05:28
@U12-Forward I'm sorry that doesn't work. It still doesn't factor in if the corresponding Dept is not listed first – shorttriptomars Sep 14 '21 at 05:28
@shorttriptomars Could you please explain again? – U13-Forward Sep 14 '21 at 05:33
@U12-Forward in the dataframe, say instead of `(Tech 5), (Math 1)` it's listed as `(Tech 1), (Math 5)` . The code you provided would still print Tech for Department, instead of changing it to Math – shorttriptomars Sep 14 '21 at 05:37
@shorttriptomars Than use `df.groupby('ID', as_index=False).max()` – U13-Forward Sep 14 '21 at 05:41
@U12-Forward I'm sorry that doesn't work. – shorttriptomars Sep 14 '21 at 05:45
@shorttriptomars - You need `idxmax` for corresponding row, check dupe for correct answer. – jezrael Sep 14 '21 at 05:50
@jezrael can you please explain how? when I add '`idmax` I get the error `'DataFrame' object has no attribute 'idmax'` – shorttriptomars Sep 14 '21 at 05:52
1

@shorttriptomars - `df.loc[df.groupby('ID')['Number'].idxmax()]` – jezrael Sep 14 '21 at 05:53
1

@shorttriptomars - Convert `Number` to numeric. – jezrael Sep 14 '21 at 05:58

How to remove and filter repeated entries in dataframe

1 Answers1