I am coding in python and my problem is the following:
I am trying to match regex for products that are described in a specific way. Let's say product 1 is ABC + alphanumeric characters or symbols (e.g. ABC123-xyz, ABC123def) and product 2 is AB + alpha numeric characters or symbols (e.g. AB123xY-z, AB123deF). I need to retrieve the full name but also the root. The root is case-sensitive, if not I could have used IgnoreCase = True.
My first attempt to match those was, for full name :
r"\bAB[^a-zA-Z\s][^,.\s]+"
for root :
\r"\bAB"
The root would match would match all examples of product 2 , but also the examples of product 1 (since AB is included in ABC) outputing AB in all cases.
The solution I have found is the following for the full text:
r"\b(?:AB(?!C))[^a-zA-Z\s][^,.\s]+"
For the root:
r"\b(?:AB(?!C))"
Which enabled me to match both products distinctively.
The use of ?:
is to match a non-capturing group (https://stackoverflow.com/a/11530881/14682360) without it, it would output only "AB" as a group.
The use of (?!C)
is to indicate a negative lookahead, which ensures that "C" is not after any "AB" matching group. For my personal use, I specified all the characters that would make the product selection intertwine (i.e. AB and ABC, DE and DEF etc.)
The use of [^a-zA-Z\s][^,.\s]+
is to match match a number or a symbol in my case and stop at an escape character that is a comma, full stop or white space.
This being said, I am sure that there are better ways of doing it.