I'm cleaning up data for a personal project and am standardizing the large number of categories. The seemingly low hanging fruit have similar enough names such as:
'SUSPECIOUS CRAFT', 'SUSPECTED MILITANTS', 'SUSPECTED PIRATE','SUSPECTED TERRORISTS', 'SUSPICICIOUS APPROACH', 'SUSPICIOPUS APPROACH', 'SUSPICIOUS APPRAOCH','SUSPICIOUS APPROACH', 'SUSPICIOUS BOAT', 'SUSPICIOUS BOATS', 'SUSPICIOUS CRAFT', 'SUSPICIOUS CRAFTS', 'SUSPICIOUS VESSEL', 'SUSPICOUS APPROACH', 'SUSPICUIOUS APPROACH','SUSPIPICIOUS APPROACH', 'SUSPISIOUC CRAFT', 'SUS[ICIOUS APPROACH'
There are others, including ones with lower and mixed case, so I'm using regex. I can select what I'm looking for with this (note that I added #8619:
df[df["hostility"].str.contains(r"^Su(s|c)(p|])(i|e)", regex=True, case=False)]
year hostility victim
878 2018 Suspicious Approach Tug
7060 2001 SUSPICIOUS CRAFT MERCHANT VESSEL
7068 2001 Suspicious group onboard a trawler YACHT
7723 2000 SUSPICIOUS CRAFT MERCHANT VESSEL
8619 2004 Protest tug
10001 2003 SUSPICIOUS CRAFT MERCHANT VESSEL
But I'm stuck at replacing all the variations so that they would like like this:
year hostility victim
878 2018 Suspicious Approach Tug
7060 2001 Suspicious Approach MERCHANT VESSEL
7068 2001 Suspicious Approach YACHT
7723 2000 Suspicious Approach MERCHANT VESSEL
8619 2004 Protest tug
10001 2003 Suspicious Approach MERCHANT VESSEL
What is the most effective to do this?