2

I have a dataframe in pandas in python with criminal charges and Person IDs. Some charges are listed twice for the same ID because an individual had 2 or more counts of the same charge. Some are listed twice because it was tried once in a district court, then moved to a superior court. Grouping by individual, I'd like to identify these charges, then drop the version from the district court.

Here is a sample dataset:

d = {'ID': [1, 1, 1, 1, 2, 2, 2, 3, 3], 
     'Court': ['district', 'district', 'district', 'superior', 'district', 'district', 'superior', 'district', 'superior'], 
     'Charge': ['larceny', 'larceny', 'assault', 'assault', 'oui', 'larceny', 'assault', 'assault', 'assault']}
df = pd.DataFrame(data=d)

    ID  Court       Charge
0   1   district    larceny
1   1   district    larceny
2   1   district    assault
3   1   superior    assault
4   2   district    oui
5   2   district    larceny
6   2   superior    assault
7   3   district    assault
8   3   superior    assault

I can identify duplicates by ID and Charge. But it needs to be a duplicate by ID and Charge but have a different value for Court.

This is the result I would like:


    ID  Court       Charge
0   1   district    larceny
1   1   district    larceny
2   1   superior    assault
3   2   district    oui
4   2   district    larceny
5   2   superior    assault
6   3   superior    assault

I can identify 'duplicate' charges using duplicated(), but am not sure how to make sure the duplicates have two different values for Court before dropping. Thanks for any help!

Laura
  • 23
  • 2

1 Answers1

0

Edit: To address different cases having the same Charge, a running number can be given as this answer shows:

# a hypothetical case_id
df["case_id"] = df.groupby(["ID", "Charge", "Court"]).cumcount() + 1

Then group the dataset by ["ID", "Charge", "case_id"] and use .max() to get the last Court by alphabetical ordering ("s"uperior > "d"istrict). You don't need to create an additional label column denoting superior=1 and district=0.

# aggreagate
df_out = df.groupby(["ID", "Charge", "case_id"]).max().reset_index()
del df_out["case_id"]

# result
df_out    
Out[45]: 
   ID   Charge     Court
0   1  assault  superior
1   1  larceny  district
2   1  larceny  district
3   2  assault  superior
4   2  larceny  district
5   2      oui  district
6   3  assault  superior
Bill Huang
  • 4,491
  • 2
  • 13
  • 31
  • This eliminates a few too many -- ID 1 should still list two larceny charges, for example. Is the .agg("max") working based on alphabetic order? – Laura Oct 18 '20 at 22:32
  • Yes I took it wrong. I have revised the answer by adding a hypothetical `case_id` column grouped by `["ID", "Charge", "Court"]` to denote duplicate cases. Now the output looks correct :) – Bill Huang Oct 18 '20 at 22:46
  • I ended up creating a dummy variable for `Superior Court` since in the real dataset, the court names have additional words that would interfere with alphabetic order. Then I used `df["final_court_superior"] = df.groupby(['ID', 'Charge', 'counter_id'])['Superior_Court'].transform('max')` to get an indicator for whether that charge ended up in Superior court. Then, dropped any records where `final_court_superior` == True and `Superior_Court == False`. Thanks @Bill for your help in getting to this solution! – Laura Oct 20 '20 at 01:19
  • You're welcome! BTW, if the interfering characters are not in the front, then I guess you can groupby the first 8 characters of `Court` as shown in [this post](https://stackoverflow.com/questions/35259801). E.g. something like `df.groupby(["ID", "Charge", df.Court.str[:8]]).cumcount()` to obtain the hypothetical counter. :) – Bill Huang Oct 20 '20 at 01:42