1

I've been sitting on this problem for several hours now and I really don't know anymore... Essentially, I have an A|B|C - type separated regex and for whatever reason C matches over B, even though the individual regexes should be tested from left-to-right and stopped in a non-greedy fashion (i.e. once a match is found, the other regex' are not tested anymore).

This is my code:

text = 'Patients with end stage heart failure fall into stage D of the ABCD classification of the American College of Cardiology (ACC)/American Heart Association (AHA), and class III–IV of the New York Heart Association (NYHA) functional classification; they are characterised by advanced structural heart disease and pronounced symptoms of heart failure at rest or upon minimal physical exertion, despite maximal medical treatment according to current guidelines.'
expansion = "American Heart Association"
re_exp = re.compile(expansion + "|" + r"(?<=\W)" + expansion + "|"\
                    + expansion.split()[0] + r"[-\s].*?\s*?" + expansion.split()[-1])

m = re_exp.search(text)
print(m.group(0))

I want regex to find the "expansion" string. In my dataset, sometimes the text has the expansion string slightly edited, for example having articles or prepositions like "for" or "the" between the main nouns. This is why I first try to just match the String as is, then try to match it if it is after any non-word character (i.e. parentheses or, like in the example above, a whole lot of stuff as the space was omitted) and finally, I just go full wild-card to find the string by search for the beginning and ending of the string with wildcards inbetween.

Either way, with the example above I would expect to get the followinging output:

American Heart Association

but what I'm getting is

American College of Cardiology (ACC)/American Heart Association

which is the match for the final regex.

If I delete the final regex or just call re.findall(r"(?<=\W)"+ expansion, text), I get the output I want, meaning the regex is in fact matching properly.

What gives?

  • 2
    This is your pattern `American Heart Association|(?<=\W)American Heart Association|American[-\s].*?\s*?Association` See https://regex101.com/r/vzN62o/1. The regex goes from left to right, and the alternatives are tried from left to right. The first 2 alternatives do not match when the first occurrence of the word `American` is encountered, the third alternative does, that is why you get that match – The fourth bird Mar 17 '21 at 20:55
  • 1
    Instead of using `.*?` you could for example not allow to match specified characters in between `\bAmerican\b[^/,.]*\bAssociation\b` https://regex101.com/r/Iz83Xb/1 or a tempered greedy token like `\bAmerican\b(?:(?!American|Association).)*\bHeart Association` https://regex101.com/r/kiR7xx/1 – The fourth bird Mar 17 '21 at 21:05

2 Answers2

1

So re.findall(r"(?<=\W)"+ expansion, text) works because before the match is a non-word character (denoted \w), "/". Your regex will match "American [whatever random stuff here] Heart Association". This means you match "American College of Cardiology (ACC)/American Heart Association" before you will match the inner string "American Heart Association". E.g. if you deleted the first "American" in your string you would get the match you are looking for with your regex.

You need to be more restrictive with your regex to rule out situations like these.

itwasthekix
  • 585
  • 6
  • 11
1

The regex looks like this:

American Heart Association|(?<=\W)American Heart Association|American[-\s].*?\s*?Association

The first 2 alternatives match the same text, only the second one has a positive lookbehind prepended.

You can omit that second alternative, as the first alternative without any assertions has either already matched it, or the second part will also not match it if the first one did not match it.

As the pattern matches from left to right and encounters the first occurrence with American, the first and the second alternatives can not match American College of Cardiology.

Then the third alternation can match it, and due to the .*? it can stretch until the first occurrence of Association.


What you might do is for example exclude possible characters to match using a negated character class:

\bAmerican\b[^/,.]*\bAssociation\b

Regex demo

Or you might use a tempered greedy token approach to not allow specific words between the first and last part:

\bAmerican\b(?:(?!American\b|Association\b).)*\bHeart Association\b

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • 1
    Thank you for the great explanation and suggestions! I though regex would go through the whole sequence testing each `A|B|C` subsequently. Guess I was wrong. While I couldn't quite figure out the regex that worked for all the sentences in the set (had to get the data out quickly), I just simply went over each sentence in the data with a `for` loop, progressively loosening the regex conditions (essentially doing what I though `A|B` does). This allowed me to cover all the examples in the set I needed without giving me invalid matches like shown above. – Chris Nagai Mar 19 '21 at 11:32