0

I have a regex that does the following:

  1. Find a word that has two or more adjacent capital letters A-Z ("multi caps word");
  2. When possible, extend the match to the left and to the right up to another multi caps word, as long as there are no more than three non-multi caps words between each multi caps word; and
  3. Extend the match to the left and to the right to include sequences of 5 and 3, respectively, non-multi caps words.

My regex catches the desired pattern but returns a variety of overlapping matches when there are adjacent multi caps words, like AA BB DD below. Please help me tweak my regex to work as desired.

Here is my draft code:

str1 =   'z z z z z11 AA BB DD f f d e gd df sdf ggf we AA ff d f f'
re.findall(r'(?=(\s(?:[^\s]+[\s]+){5}(?:[^A-Z\s]*[A-Z][A-Z]+(?:[^\s]+[\s]+){1,3}?)*?[^A-Z\s]*[A-Z][A-Z]+.*?(?:[\s]+[^\s]+){3}\s))', str1)

Actual Output:

Match 1 - 'z z z z z11 AA BB DD f'
Match 2 - 'z z z z11 AA BB DD f f'
Match 3 - 'z z z11 AA BB DD f f d'
Match 4 - 'gd df sdf ggf we AA ff d f'

Desired output:

Match 1 - 'z z z z z11 AA BB DD f f d'
Match 2 - 'gd df sdf ggf we AA ff d f'
falsetru
  • 357,413
  • 63
  • 732
  • 636
user2104778
  • 992
  • 1
  • 14
  • 38

1 Answers1

1

Try this:

>>> pattern = r'(?:[a-z\d]+\s*){0,5}(?:[A-Z]+)(?:\s*[A-Z]+)*(?:\s*[a-z]+){0,3}'
>>> re.findall(pattern, str1)
['z z z z z11 AA BB DD f f d', 'gd df sdf ggf we AA ff d f']
falsetru
  • 357,413
  • 63
  • 732
  • 636
  • I appreciate the effort but the answer missed several of the desired goals of my regex. I want multi-caps words for my point 1, not single caps words. Also, I tried the answer on some other data and it broke down. What I really want is a tweak of my existing regex. My existing regex has an enormous amount of thought already in it. – user2104778 Jul 27 '14 at 04:02
  • 1
    @user2104778, The second part of my answer was invalid, I deleted it. Did you try the first one? – falsetru Jul 27 '14 at 04:05
  • yes I tried it as well and it broke down where there were digits in words, or punctuation. I may see if I can tweak your answer later but as it stands it doesn't do the job. Don't get me wrong though, I appreciate your attempt. – user2104778 Jul 27 '14 at 04:07
  • 1
    @user2104778, How about this? http://ideone.com/79C6lB (This is basically yours simplified, without lookahead assertion) – falsetru Jul 27 '14 at 04:26
  • Close enough. I realized what the problem was. My regex pattern for non multi caps words caught multi caps words. I revamped the definition and it worked. – user2104778 Jul 29 '14 at 16:26