regex findall overlapped does not give match if one of them is a prefix of the other

Question

import regex

product_detail = "yyy target1 target2 xxx".lower()
p1 = r"\btarget1\b|\btarget1 target2\b"
p2 = r"\btarget2\b|\btarget1 target2\b"
for pattern in [p1, p2]:
    matches = regex.findall(pattern, product_detail, overlapped=True)
    print(matches)

why does matches from p1 only give ['target1'] as output, without 'target1 target2'

but matches from p2 can successfully give ['target1 target2', 'target2'] as output.

Also if you can provide a fix, how do i generalise it? i have a list of 10000 target words and its not going to be feasible to hardcode them.

Try placing the longer string first `r"\btarget1 target2\b|\btarget1\b"` — Alain T., Feb 25 '23 at 05:59
@AlainT. tried that, it gives the first occurence, but i want both target. — leonardltk1, Feb 25 '23 at 06:01
What do you mean by, "it gives the first occurrence"? @AlainT. is correct. Using `p1` the regex engine begins by attempting to match the string beginning with the first `y`. It first tries to match `\btarget1\b`. That fails, so it tries to match the second part of the alternation, `\btarget1 target2\b`, which also fails. The string pointer is then moved to the second `y` and the same attempt is made to match the regex. Both parts of the regex fail again so the string pointer is moved to the third `y`... — Cary Swoveland, Feb 25 '23 at 06:27
I see, they are not actually "overlapping" in the sense that regex understands it because the pattern only ever counts as one match (i.e variations of matching length are not considered to be overlaps). You'll probably have to split the common prefixes into separate patterns and do multiple findalls. — Alain T., Feb 25 '23 at 06:30
...The pointer is now moved to the space following the third `y` and the attempt to match fails again, so the pointer is moved to the `t`. A match is then made with the first part of the alternation, `\btarget1\b`. The pointer is then moved to the space following `target1` and the process continues. No match is made at the space to the pointer is moved to the `t` of `target2`. That matches neither `\btarget1\b` nor `\btarget1 target2\b` so the pointer is moved to the `a`. Clearly, there will be no more matches in the string... — Cary Swoveland, Feb 25 '23 at 06:35
...By contrast, if you adopt @AlainT.'s suggestion of using the regex `\btarget1 target2\b|\btarget1\b`, the first `t` will match `\btarget1 target2\b` and you are finished. Now if that's not what you want you need to clarify the question (by editing it). Presumably you do not want to match `\btarget1\b|\btarget2\b`, which would give you two matches. — Cary Swoveland, Feb 25 '23 at 06:40
If you only want to match `"target1"` and `"target2"` if and only if `r"\btarget1 target2\b"` is matched, you can use the regex `r"\btarget1(?= target2\b)|(?<=\btarget1 )target2\b"`. [Demo](https://regex101.com/r/Kk7b60/1). As noted at the link, `(?= target2\b)` is a *positive lookahead* and `(?<=\btarget1 )` is a *positive lookbehind*. I don't know how that will work with thousands of "target" words, as I do not fully understand your question. — Cary Swoveland, Feb 25 '23 at 07:10
the question is quite straight forward: for all keywords in `keyword_lst`, i want to find them in `text`, regardless whether these keywords are prefix/suffix/substring of each other. and i am asking why the p1 does not give me that — leonardltk1, Feb 25 '23 at 08:25
Related to @CarySwoveland's comment... To capture [overlapping matches](https://stackoverflow.com/questions/11430863/how-to-find-overlapping-matches-with-a-regexp) needs to be done inside a lookahead. Something like e.g. [`\b(?=((?:target[12] )?target[12])\b)`](https://regex101.com/r/KaIr5S/1) captures will be found in the *first group*. — bobble bubble, Feb 25 '23 at 08:36
leonardltk1, do you mean you wish to determine which words in `keyword_lst` appear in the text before or after another word in `keyword_lst`? As to why `p1` does not give you that, have I answered that in my comments above? — Cary Swoveland, Feb 25 '23 at 19:42
@CarySwoveland nope, in keyword_lst, i want all occurence of it shown in product_detail. of course i can iterate them 1 by 1, but im going for efficiency here, so i was experimenting if there are any optimisation done within regex.findall. But yes i think you have answered it above, if there are some who are a prefix of the other, it cant be flagged out. — leonardltk1, Feb 27 '23 at 11:11

Alain T. · Accepted Answer · 2023-02-25T07:24:46.590

Here is an example of what I had in mind with my comment on building a list of patterns separating common prefixes:

import regex  # I'm actually using re (don't have regex)

product_detail = "yyy target1 target2 xxx".lower()

keywords = ["target1","target2","target1 target2","target3"]

from itertools import accumulate, groupby, zip_longest

keywords.sort()
groups   = accumulate(keywords,lambda g,k:g if k.startswith(g) else k)
patterns = ( g for _,(*g,) in groupby(keywords,lambda _:next(groups)) )
patterns = ( filter(None,g) for g in zip_longest(*patterns) )   
patterns = [r"\b" + r"\b|\b".join(g) + r"\b" for g in patterns]

# [r'\btarget1\b|\btarget2\b|\btarget3\b', r'\btarget1 target2\b']

for pattern in patterns:
    matches = regex.findall(pattern, product_detail)
    print(matches)

output:

['target1', 'target2']
['target1 target2']

regex findall overlapped does not give match if one of them is a prefix of the other

1 Answers1