0

I am coding in python and my problem is the following:

I am trying to match regex for products that are described in a specific way. Let's say product 1 is ABC + alphanumeric characters or symbols (e.g. ABC123-xyz, ABC123def) and product 2 is AB + alpha numeric characters or symbols (e.g. AB123xY-z, AB123deF). I need to retrieve the full name but also the root. The root is case-sensitive, if not I could have used IgnoreCase = True.

My first attempt to match those was, for full name :

r"\bAB[^a-zA-Z\s][^,.\s]+"

for root :

\r"\bAB"

The root would match would match all examples of product 2 , but also the examples of product 1 (since AB is included in ABC) outputing AB in all cases.

The solution I have found is the following for the full text:

r"\b(?:AB(?!C))[^a-zA-Z\s][^,.\s]+"

For the root:

r"\b(?:AB(?!C))"

Which enabled me to match both products distinctively.

The use of ?: is to match a non-capturing group (https://stackoverflow.com/a/11530881/14682360) without it, it would output only "AB" as a group.

The use of (?!C) is to indicate a negative lookahead, which ensures that "C" is not after any "AB" matching group. For my personal use, I specified all the characters that would make the product selection intertwine (i.e. AB and ABC, DE and DEF etc.)

The use of [^a-zA-Z\s][^,.\s]+is to match match a number or a symbol in my case and stop at an escape character that is a comma, full stop or white space.

This being said, I am sure that there are better ways of doing it.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • What is the question here? You don't need the non capture group, as by itself currently it has no purpose. Also in the pattern `\b(?:AB(?!C))[^a-zA-Z\s][^,.\s]+"` the negative lookahead can be omitted as the following negated character class `[^a-zA-Z\s]` actually matches any character other than A-Z which covers a `C` char as well. – The fourth bird Dec 10 '21 at 11:23
  • You are absolutely right, there is a redundancy when retrieving the full name of the item thank you for that. The problem is in retrieving the intials. When querying for AB, I also match on the ABC items, and my objective is to retrieve all the item's initials once. To be fair there is no real question, as I found a working unclean solution. Just thought I'd share my initial problem and solution for anyone in a similar situation. – cadmoska Dec 10 '21 at 13:36

1 Answers1

-1

If I've understood your question correctly, the requirement is just not to match if there is a 'C' immediately after the initial 'AB'. The following would be simpler and a bit shorter than using a negative lookahead:

\bAB[^C,.\s][^,.\s]+

You can see it in action here:

https://regex101.com/r/3VPwLK/1

If you know that you will never encounter 'AB' on its own followed by a comma, period, or space, you could shorten this to:

\bAB[^C][^,.\s]+

This matches the full text, but you could use capturing groups to get only the root or only the part after the root. But unless I'm missing something isn't the root just always AB for product 2?

Capture root only: \b(AB)[^C,.\s][^,.\s]+

Capture part after root only: \bAB([^C,.\s][^,.\s]+)

The following would match both product types, with the root in the first capturing group:

\b(ABC?)[^,.\s]+
ljdyer
  • 1,946
  • 1
  • 3
  • 11