regex: finding match that satisfies a specific length constrain

Question

I have a regex job to search for a pattern

(This) <some words> (is a/was a) <some words> (0-4 digit number) <some words> (word)

where <some words> can be any number of words/charecters including a space. I used the following to get achieve this.

(^|\W)This(?=\W).*?(?<=\W)(is a|was a)(?=\W).*?(?<=\W)(\d{1,4})((?=\W).*?(?<=\W))*(word)(?=\W)

I also have another constrain: the total length of the match should be less than 30 char. Currently, my search works for all lengths and searches for all sets of words. Is there an option in regex which I can use to achieve this constrain using the regex string itself?

I am currently getting this done by looking at the length of the matched regex objects. I have to deal with strings that are more than the required length and this is causing issues which misses some detections which are under the length constrain.

for eg: string:

"hi This is a alpha, bravo Charley, delta, echo, fox, golf, this is a 12 word finish."

has 2 matches:

"This is a alpha, bravo Charley, delta, echo, fox, golf, this is 12 word"
"this is a 12 word"

My search captures the first one and misses the second. But the second one matches my length criteria.

If the first match is less than the length constrain then I can ignore the second match.

I am using re.sub() to replace those strings and use a repl function inside sub() to check the length. My dataset is large, so the search takes a lot of time. The most important thing to me is to do the search efficiently including the length constraints so as to avoid these incorrect matches.

I am using python 3

Thanks in advance

Afaik the only way to [get overlapping matches](https://stackoverflow.com/questions/11430863/how-to-find-overlapping-matches-with-a-regexp) is by [capturing inside a lookahead](https://regex101.com/r/XydR0X/1). — bobble bubble, Jul 05 '21 at 21:21
Does this answer your question? [Python regex find all overlapping matches?](https://stackoverflow.com/questions/5616822/python-regex-find-all-overlapping-matches) — Ryszard Czech, Jul 05 '21 at 21:44
Thanks for the replay. I am trying to do a regex sub in a large set of data. Not sure how to do that. If possible ** I want to know how can I include the length constrain as part of the search itself, that way the search is efficient**. If the double matching happens inside a valid length string then I can ignore it. I will updates these details in the question. — sachin mathew jose, Jul 06 '21 at 16:44

score 1 · Answer 1 · answered Jul 06 '21 at 18:15

The regex engine doesn't provide a method to do exactly what you're asking for; you'd need to use regex in conjunction with another tool to get the result you want.

Building on some of the comments on your question, the following regex will return the entire match (everything from 'This' through 'word'):

\b(?=([Tt]his\b.+?\b(?:i|wa)s a\b.+?\b\d{1,4}\b.+?\bword))\b

You can then filter the results to only produce the output you're looking for.

import re
string = 'hi This is a alpha, bravo Charley, delta, echo, fox, golf, this is a 12 word finish.'
pat = re.compile(r'\b(?=([Tt]his\b.*?\b(?:i|wa)s a\b.*?\b\d{1,4}\b.*?\bword))\b')

# returns ['this is a 12 word']
[x[1] for x in pat.finditer(string) if len(x[1]) < 30]

regex: finding match that satisfies a specific length constrain

1 Answers1