0

I have a dataframe which I named parking which has multiple columns, in this case Registration State, Violation Code, and Summons Number.

For each Registration State, I want the 3 Violation Codes which the highest row count. The best I've been able to get is:

parking_state_group = parking.groupby(['Registration State', 'Violation Code'])['Summons Number'].count()

When printed (i.e. print(parking_state_group.reset_index()) looks like:

     Registration State  Violation Code  Summons Number
0                    99               0              14
1                    99               6               1
2                    99              10               6
3                    99              13               2
4                    99              14              75
...                 ...             ...             ...
1811                 WY              37               3
1812                 WY              38               4
1813                 WY              40               4
1814                 WY              46               1
1815                 WY              68               1

This at least gets me the count of each Violation Code for each state (Summons Number is like an ID field for each row). I want this to return only the 3 violation codes for each state with the highest count, so something like:

      Registration State  Violation Code  Summons Number
0                    99               14             75
1                    99               31             61
2                    99               87             55
...                 ...             ...             ...
1812                 WY              38               4
1813                 WY              40               4
1811                 WY              37               3

I've tried .nlargest() but this doesn't seem to get the largest .count(), only the largest values within a column, which isn't what I'm looking for.

  • Have you checked nlargest? – Vaishali Aug 21 '20 at 23:16
  • Malcolm, can you create a reproducible example? There is data in the output, which is not in the input, so it cannot be reproduced: Hope this helps: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – David Erickson Aug 21 '20 at 23:18
  • @DavidErickson Appreciate the link. Let's do this as an example: import pandas as pd df = pd.DataFrame({'state': ['NY', 'CA', 'TX', 'NY', 'CA', 'TX', 'NY', 'CA', 'TX', 'TX', 'NY' ,'CA', 'NY', 'CA', 'TX'], 'violation_code': ['31', '25', '33', '31', '25', '33', '23', '43', '01', '31', '25', '33', '31', '25', '33'], 'ticket_id': [001, 002, 003, 004, 005, 006, 007, 008, 009, 010, 011, 012, 013, 014, 015]}) For each state, I want the 2 violation codes that show up the most, as well as how many times that violation code appears for each state – Malcolm007 Aug 22 '20 at 12:13

1 Answers1

0

Lets try

df[['Registration State', 'Violation Code', 'Summons Number']].groupby('Registration State')['Summons Number'].nlargest(3).reset_index().rename(columns={'level_1':'Violation Code'})
wwnde
  • 26,119
  • 6
  • 18
  • 32
  • This didn't bring the count of violation codes, but for each state it did bring 3 summons numbers. Not really sure what the level_1 column represents, but the idea is that I want the Registration State column, and then in the Violation Code column, the respective violation code, and then in the Summons Number column the count of rows (using Summons Number) for that respective state and violation code. It would be the top 3 violation codes in each state – Malcolm007 Aug 22 '20 at 12:18