I have a regex job to search for a pattern
(This) <some words> (is a/was a) <some words> (0-4 digit number) <some words> (word)
where <some words>
can be any number of words/charecters including a space.
I used the following to get achieve this.
(^|\W)This(?=\W).*?(?<=\W)(is a|was a)(?=\W).*?(?<=\W)(\d{1,4})((?=\W).*?(?<=\W))*(word)(?=\W)
I also have another constrain: the total length of the match should be less than 30 char. Currently, my search works for all lengths and searches for all sets of words. Is there an option in regex which I can use to achieve this constrain using the regex string itself?
I am currently getting this done by looking at the length of the matched regex objects. I have to deal with strings that are more than the required length and this is causing issues which misses some detections which are under the length constrain.
for eg: string:
"hi This is a alpha, bravo Charley, delta, echo, fox, golf, this is a 12 word finish."
has 2 matches:
- "This is a alpha, bravo Charley, delta, echo, fox, golf, this is 12 word"
- "this is a 12 word"
My search captures the first one and misses the second. But the second one matches my length criteria.
If the first match is less than the length constrain then I can ignore the second match.
I am using re.sub() to replace those strings and use a repl function inside sub() to check the length. My dataset is large, so the search takes a lot of time. The most important thing to me is to do the search efficiently including the length constraints so as to avoid these incorrect matches.
I am using python 3
Thanks in advance