I have a regex that does the following:
- Find a word that has two or more adjacent capital letters A-Z ("multi caps word");
- When possible, extend the match to the left and to the right up to another multi caps word, as long as there are no more than three non-multi caps words between each multi caps word; and
- Extend the match to the left and to the right to include sequences of 5 and 3, respectively, non-multi caps words.
My regex catches the desired pattern but returns a variety of overlapping matches when there are adjacent multi caps words, like AA BB DD below. Please help me tweak my regex to work as desired.
Here is my draft code:
str1 = 'z z z z z11 AA BB DD f f d e gd df sdf ggf we AA ff d f f'
re.findall(r'(?=(\s(?:[^\s]+[\s]+){5}(?:[^A-Z\s]*[A-Z][A-Z]+(?:[^\s]+[\s]+){1,3}?)*?[^A-Z\s]*[A-Z][A-Z]+.*?(?:[\s]+[^\s]+){3}\s))', str1)
Actual Output:
Match 1 - 'z z z z z11 AA BB DD f'
Match 2 - 'z z z z11 AA BB DD f f'
Match 3 - 'z z z11 AA BB DD f f d'
Match 4 - 'gd df sdf ggf we AA ff d f'
Desired output:
Match 1 - 'z z z z z11 AA BB DD f f d'
Match 2 - 'gd df sdf ggf we AA ff d f'