1

I have a dataframe with the following 3 columns

  ID       Department      Number
---------------------------------
2324              Art           4
2324             Math           1
2324              Art           3
2400          Science           2
2593             Tech           5
2593             Math           1

I'm trying to filter first by ID, then by Number. As you can see, some IDs repeat. I want to first find the IDs that repeat, then choose the highest corresponding Number.

For instance, as you can see the ID 2324 repeats 3 times. (4, 1, 3) are the Numbers that correspond with each entry of 2324. Since 4 is the bigger number, I choose the entry with 4. I want to filter the dataframe to get this output:

   ID       Department      Number
---------------------------------
2324              Art           4
2400          Science           2
2593             Tech           5

This is my code so far :

for previous, current in zip(df['ID'], df['ID'][1:]):
   for i, j in zip(df['Number'], df['Number'][1:]):
     if previous == current:
        if j> i:

However, I don't know what I should add next to correctly print(previous, j). If I add print(previous, j) to the nested loop, I get repeated entries. For instance, if my code was

for previous, current in zip(df['ID'], df['ID'][1:]):
   for i, j in zip(df['Number'], df['Number'][1:]):
     if previous == current:
        if j> i:
          print(previous, j)

it outputs

2324       4
2324       4
2324       4
2324       4
2324       4
2324       4        
2593       5   
2593       5
2593       5   
2593       5 

I want it to output:

2324       4     
2593       5   

I also want it to include the ID that was not repeated, so 2400 2 in this case. Additionally, I don't know how to append the correct Department name to the nested loop.

Thank you to anyone who took the time to read this and help me out. I really appreciate it.

shorttriptomars
  • 325
  • 1
  • 9
  • This one is probably the exact solution you are looking for [Python Pandas Dataframe select row by max value in group](https://stackoverflow.com/questions/32459325/python-pandas-dataframe-select-row-by-max-value-in-group) – ThePyGuy Sep 14 '21 at 05:20
  • You don't need loop to achieve what you want, you can first sort the values of your Number column and then use the drop_duplciates, like this: `df.sort_values(by=['num'], ascending=False)` followed by `df.drop_duplicates("id", inplace=True)`, by default, the drop_duplicates will keep the first entry. – user2906838 Sep 14 '21 at 05:24

1 Answers1

1

Use:

df.loc[df.groupby('ID')['Number'].idxmax()]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
U13-Forward
  • 69,221
  • 14
  • 89
  • 114