I've been sitting on this problem for several hours now and I really don't know anymore... Essentially, I have an A|B|C - type separated regex and for whatever reason C matches over B, even though the individual regexes should be tested from left-to-right and stopped in a non-greedy fashion (i.e. once a match is found, the other regex' are not tested anymore).
This is my code:
text = 'Patients with end stage heart failure fall into stage D of the ABCD classification of the American College of Cardiology (ACC)/American Heart Association (AHA), and class III–IV of the New York Heart Association (NYHA) functional classification; they are characterised by advanced structural heart disease and pronounced symptoms of heart failure at rest or upon minimal physical exertion, despite maximal medical treatment according to current guidelines.'
expansion = "American Heart Association"
re_exp = re.compile(expansion + "|" + r"(?<=\W)" + expansion + "|"\
+ expansion.split()[0] + r"[-\s].*?\s*?" + expansion.split()[-1])
m = re_exp.search(text)
print(m.group(0))
I want regex to find the "expansion" string. In my dataset, sometimes the text has the expansion string slightly edited, for example having articles or prepositions like "for" or "the" between the main nouns. This is why I first try to just match the String as is, then try to match it if it is after any non-word character (i.e. parentheses or, like in the example above, a whole lot of stuff as the space was omitted) and finally, I just go full wild-card to find the string by search for the beginning and ending of the string with wildcards inbetween.
Either way, with the example above I would expect to get the followinging output:
American Heart Association
but what I'm getting is
American College of Cardiology (ACC)/American Heart Association
which is the match for the final regex.
If I delete the final regex or just call re.findall(r"(?<=\W)"+ expansion, text)
, I get the output I want, meaning the regex is in fact matching properly.
What gives?