0

Im trying to create a new column filling with a value ('company') if values in another column match one of the patterns in the regex below:

"INC|INC$|INC$|LTD$|CORP$|CORPORATION$|COMPANY$|LLC$|\*LLC$|\*,INC$|\*,CORP$|\*LTD$|\*CORP$|LEASING|TRANSPORTATION|CONSULTANTS|SERVICES|INCORPORATED"

Here is what i tried:

patterns = [".INC.","INC$", ",INC$","LTD$", "CORP$", "CORPORATION$", "COMPANY$", "LLC$", ".*([a-zA-Z]+)LLC$", ".*([a-zA-Z]+),INC$", ".*([a-zA-Z]+),CORP$", ".*([a-zA-Z]+)LTD$", ".*([a-zA-Z]+)CORP$", "LEASING", "TRANSPORTATION", "CONSULTANTS", "SERVICES", "INCORPORATED"]

patterns = re.compile('|'.join(patterns))

data.loc[data['OwnerName'].str.contains(patterns), 'owner'] = 'company'

It matches and renames some strings but not the others. For instance: xxx,INC is matched but xxx INC is not matched.

Could you please point out what am i doing wrong. Thanks!

The xxx, INC string should turn into company if matched. But it does not.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Nikita Voevodin
  • 137
  • 2
  • 14

1 Answers1

0

To match optional trailing whitespace, you can add \s* before $.

Also, some values in the regex you provided are redundant, you can greatly shorten the pattern if you use

patterns = ["INC",r"LTD\s*$",r"CORP\s*$",r"CORPORATION\s*$",r"COMPANY\s*$",r"LLC\s*$","LEASING","TRANSPORTATION","CONSULTANTS","SERVICES"]
patterns = re.compile('|'.join(patterns))
data.loc[data['OwnerName'].str.contains(patterns), 'owner'] = 'company'

Use raw string literals when defining patterns with literal backslash to avoid warnings.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563