Capture all consecutive all-caps words with regex in python?

Question

I am trying to match all consecutive all caps words/phrases using regex in Python. Given the following:

    text = "The following words are ALL CAPS. The following word is in CAPS."

The code would return:

    ALL CAPS, CAPS

I am currently using:

    matches = re.findall('[A-Z\s]+', text, re.DOTALL)

But this returns:

    ['T', ' ', ' ', ' ', ' ALL CAPS', ' T', ' ', ' ', ' ', ' ', ' CAPS']

I clearly don't want the punctuation or the 'T'. I want to return only consecutive words or a single word that only include all capital letter.

Thanks

What do you expect when the words aren't separated by a space like `ABC.DEF`? — Casimir et Hippolyte, Apr 20 '17 at 15:03
Why do you use the option `re.DOTALL` since there is no dot in your pattern? — Casimir et Hippolyte, Apr 20 '17 at 15:05
It was just copied in from another command. It doesn't change the output though. Very new to regex, so certainly not doing this right. — BHudson, Apr 20 '17 at 15:06
I'm confused, your question asks for `consecutive all-caps words` but your example of results you want indicates you're just looking for **any** all-caps words. — Kind Stranger, Apr 20 '17 at 15:21
should have been more clear- I want to capture all words in all caps, but if they are consecutive, I want them to be returned as a phrase, not individuals words. — BHudson, Apr 20 '17 at 16:40

Toto · Accepted Answer · 2017-04-21T17:11:35.317

4

This one does the job:

import re
text = "tHE following words aRe aLL CaPS. ThE following word Is in CAPS."
matches = re.findall(r"(\b(?:[A-Z]+[a-z]?[A-Z]*|[A-Z]*[a-z]?[A-Z]+)\b(?:\s+(?:[A-Z]+[a-z]?[A-Z]*|[A-Z]*[a-z]?[A-Z]+)\b)*)",text)
print matches

Output:

['tHE', 'aLL CaPS', 'ThE', 'Is', 'CAPS']

Explanation:

(           : start group 1
  \b        : word boundary
  (?:       : start non capture group
    [A-Z]+  : 1 or more capitals
    [a-z]?  : 0 or 1 small letter
    [A-Z]*  : 0 or more capitals
   |        : OR
    [A-Z]*  : 0 or more capitals
    [a-z]?  : 0 or 1 small letter
    [A-Z]+  : 1 or more capitals
  )         : end group
  \b        : word boundary
  (?:       : non capture group
    \s+     : 1 or more spaces
    (?:[A-Z]+[a-z]?[A-Z]*|[A-Z]*[a-z]?[A-Z]+) : same as above
    \b      : word boundary
  )*        : 0 or more time the non capture group
)           : end group 1

edited Apr 21 '17 at 17:11

answered Apr 20 '17 at 15:20

Toto

89,455
62
89
125

Thanks! Accepted the answer. This works perfect. Any chance you could help with adding a flag to *allow* (not require) one lowercase character in the captured string? I think to require one lowercase would be something like '(?=.*[a-z]) ', but I want to get phrases/words like this- 'ALL CaPS' or 'CaPS' as well, but also 'ALL CAPS', CAPS'. Thanks again – BHudson Apr 21 '17 at 11:21
@BHudson: Must the words begin with a capital? What about `The` at the begining of the string? – Toto Apr 21 '17 at 11:31
They need not begin with a capital, but all but one character should be in caps. I have a huge document where names are in all caps, but many are misspelled. I am extracting all the names and will use fuzzy matching to correct them. Some incorrectly have a single lowercase letter. So, I need to match 'tHE' or 'ThE', but not 'The' (my likely use will be 'JOhN SMITh' or 'jOHN SMiTH'). Thanks – BHudson Apr 21 '17 at 11:36
Thanks! Perfect. Appreciate the explanation. – BHudson Apr 22 '17 at 11:03
@BHudson: You're welcome, glad it helps. Feel free to upvote ;-) – Toto Apr 22 '17 at 11:53
1

Best regex walk through I've seen yet. Well done @Toto. – alofgran Jan 20 '21 at 22:43

Dashadower · Answer 2 · 2017-04-20T15:20:02.010

1

Your regex is relying on explicit conditions(space after letters).

matches = re.findall(r"([A-Z]+\s?[A-Z]+[^a-z0-9\W])",text)

Capture A to Z repetitions if there are no trailing lowercase or none-alphabet character.

edited Apr 20 '17 at 15:20

answered Apr 20 '17 at 15:07

Dashadower

632
1
6
20

OP says "ALL CAPS" should be 1 match group, and you missed the word A case – Tezra Apr 20 '17 at 15:11
@Tezra Edited to meet requirements. – Dashadower Apr 20 '17 at 15:20

score 1 · Answer 3 · answered Apr 20 '17 at 15:28

1

Keeping your regex, you can use strip() and filter:

string = "The following words are ALL CAPS. The following word is in CAPS."
result = filter(None, [x.strip() for x in re.findall(r"\b[A-Z\s]+\b", string)])
# ['ALL CAPS', 'CAPS']

answered Apr 20 '17 at 15:28

Pedro Lobito

94,083
31
258
268

score 0 · Answer 4 · answered Apr 20 '17 at 15:08

0

Assuming you want to start and end on a letter, and only include letters and whitespace

\b([A-Z][A-Z\s]*[A-Z]|[A-Z])\b

|[A-Z] to capture just I or A

answered Apr 20 '17 at 15:08

Tezra

8,463
3
31
68

Capture all consecutive all-caps words with regex in python?

4 Answers4

Linked