2

Looking for a way to extract only words that are in ALL CAPS from a text string. The catch is that it shouldn't extract other words in the text string that are mixed case.

For example, how do I use regex to extract KENTUCKY from the following sentence:

There Are Many Options in KENTUCKY

I'm trying to do this using regexextract() in Google Sheets, which uses RE2.

Looking forward to hearing your thoughts.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • By doing something like this `/\b([A-Z]{2,})\b/`. [Demo](https://regex101.com/r/gP5bO8/1). Is that right ? –  Jan 26 '16 at 17:04

4 Answers4

1

Pretending that your text is in cell A2:

If there is only one instance in each text segment this will work:

=REGEXEXTRACT(A2,"([A-Z]{2,})")

If there are multiple instances in a single text segment then use this, it will dynamically adjust the regex to extract every occurrance for you:

=REGEXEXTRACT(A2, REPT(".* ([A-Z]{2,})", COUNTA(SPLIT(REGEXREPLACE(A2,"([A-Z]{2,})","$"),"$"))-1))
Aurielle Perlmann
  • 5,323
  • 1
  • 15
  • 26
1

If you need to extract whole chunks of words in ALLCAPS, use

=REGEXEXTRACT(A2,"\b[A-Z]+(?:\s+[A-Z]+)*\b")
=REGEXEXTRACT(A2,"\b\p{Lu}+(?:\s+\p{Lu}+)*\b")

See this regex demo.

Details

  • \b - word boundary
  • [A-Z]+ - 1+ ASCII letters (\p{Lu} matches any Unicode letters inlcuding Arabic, etc.)
  • (?:\s+[A-Z]+)* - zero or more repetitions of
    • \s+ - 1+ whitespaces
    • [A-Z]+ - 1+ ASCII letters (\p{Lu} matches any Unicode letters inlcuding Arabic, etc.)
  • \b - word boundary.

Or, if you allow any punctuations or symbols between uppercase letters you may use

=REGEXEXTRACT(A2,"\b[A-Z]+(?:[^a-zA-Z0-9]+[A-Z]+)*\b")
=REGEXEXTRACT(A2,"\b\p{Lu}+(?:[^\p{L}\p{N}]+\p{Lu}+)*\b")

See the regex demo.

Here, [^a-zA-Z0-9]+ matches one or more chars other than ASCII letters and digits, and [^\p{L}\p{N}]+ matches any one or more chars other than any Unicode letters and digits.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks for the solution. Any way to make it work for any one or more uppercase word of 2 or more letters, with output in a single cell, and preferably in an array formula? Please let me know if you'd need me to ask a new question. The accepted answer outputs in split cells and is not ready made for arrayformula. I tested this but it returns only the 1st uppercase word in the cell: `=IFNA(ArrayFormula(REGEXEXTRACT(A1:A,"\b([A-Z]{2,})+(?:\s+[A-Z]+)*\b")),"")` – Lod Feb 14 '23 at 13:35
  • For example, I'd need to return `ONE TWO THREE` in `B1`, from `A1` content: `This cell's contains ONE paragraph, TWO commas, and THREE uppercase words.` – Lod Feb 14 '23 at 13:39
  • 1
    @Lod `REGEXEXTRACT` only returns a single (first) match, it is by design. You need an extra script here, I believe. – Wiktor Stribiżew Feb 14 '23 at 13:51
0

This should work:

\b[A-Z]+\b

See demo

Aferrercrafter
  • 319
  • 1
  • 6
  • 14
0

2nd EDIT ALL CAPS / UPPERCASE solution:

Finally got this simpler way from great other helping solutions here and here:

=trim(regexreplace(regexreplace(C15,"(?:([A-Z]{2,}))|.", " $1"), "(\s)([A-Z])","$1 $2"))

From this input:

isn'ter JOHN isn'tar DOE isn'ta or JANE

It returns this output:

JOHN DOE JANE

The Same For Title Case (Extracting All Capitalized / With 1st Letter As Uppercase Words :

Formula:

=trim(regexreplace(regexreplace(C1,"(?:([A-Z]([a-z]){1,}))|.", " $1"), "(\s)([A-Z])","$1 $2"))

Input in C1:

The friendly Quick Brown Fox from the woods Jumps Over the Lazy Dog from the farm.

Output in A1:

The Quick Brown Fox Jumps Over Lazy Dog

Previous less efficient trials :

I had to custom tailor it that way for my use case:

= ArrayFormula(IF(REGEXMATCH(REGEXREPLACE(N3: N,
    "(^[A-Z]).+(,).+(\s[a-z]\s)|(^[A-Z][a-z]).+(\s[a-z][a-z]\s)|(^[A-Z]\s).+(\.\s[A-Z][a-z][a-z]\s)|[A-Z][a-z].+[0-9]|[A-Z][a-z].+[0-9]+|(^[A-Z]).+(\s[A-Z]$)|(^[A-Z]).+(\s[A-Z][a-z]).+(\s[A-Z])|(\s[A-Z][a-z]).+(\s[A-Z]\s).+(\s[A-Z])|(^[A-Z][a-z]).+(\s[A-Z]$)|(\s[A-Z]\s).+(\s[A-Z]\s)|(\s[A-Z]\s)|^[A-Z].+\s[A-Z]((\?)|(\!)|(\.)|(\.\.\.))|^[A-Z]'|^[A-Z]\s|\s[A-Z]'|[A-Z][a-z]|[a-z]{1,}|(^.+\s[A-Z]$)|(\.)|(-)|(--)|(\?)|(\!)|(,)|(\.\.\.)|(\()|(\))|(\')|("
    ")|(“)|(”)|(«)|(»)|(‘)|(’)|(<)|(>)|(\{)|(\})|(\[)|(\])|(;)|(:)|(@)|(#)|(\*)|(¦)|(\+)|(%)|(¬)|(&)|(|)|(¢)|($)|(£)|(`)|(^)|(€)|[0-9]|[0-9]+",
    ""), "[A-Z]{2,}") = FALSE, "", REGEXREPLACE(N3: N,
    "(^[A-Z]).+(,).+(\s[a-z]\s)|(^[A-Z][a-z]).+(\s[a-z][a-z]\s)|(^[A-Z]\s).+(\.\s[A-Z][a-z][a-z]\s)|[A-Z][a-z].+[0-9]|[A-Z][a-z].+[0-9]+|(^[A-Z]).+(\s[A-Z]$)|(^[A-Z]).+(\s[A-Z][a-z]).+(\s[A-Z])|(\s[A-Z][a-z]).+(\s[A-Z]\s).+(\s[A-Z])|(^[A-Z][a-z]).+(\s[A-Z]$)|(\s[A-Z]\s).+(\s[A-Z]\s)|(\s[A-Z]\s)|^[A-Z].+\s[A-Z]((\?)|(\!)|(\.)|(\.\.\.))|^[A-Z]'|^[A-Z]\s|\s[A-Z]'|[A-Z][a-z]|[a-z]{1,}|(^.+\s[A-Z]$)|(\.)|(-)|(--)|(\?)|(\!)|(,)|(\.\.\.)|(\()|(\))|(\')|("
    ")|(“)|(”)|(«)|(»)|(‘)|(’)|(<)|(>)|(\{)|(\})|(\[)|(\])|(;)|(:)|(@)|(#)|(\*)|(¦)|(\+)|(%)|(¬)|(&)|(|)|(¢)|($)|(£)|(`)|(^)|(€)|[0-9]|[0-9]+",
    "")))

Going one by one over all exceptions and adding their respective regex formulations to the front of the multiple pipes separated regexes in the regexextract function.

@Wiktor Stribiżew any simplifying suggestions would be very welcome.

Found some missing and fixed them.

1st EDIT:

A simpler version though still quite lengthy:

= ArrayFormula(IF(REGEXMATCH(REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(
            REGEXREPLACE(REGEXREPLACE(P3: P, "[a-z,]",
                " "), "-|\.", " "), "(^[A-Z]\s)", " "
            ), "(\s[A-Z]\s)", " "),
    "\sI'|\sI\s|^I'|^I\s|\sI(\.|\?|\!)|\sI$|\sA\s|^A\s|\.\.\.|\.|-|--|,|\?|\!|\.|\(|\)|'|"
    "|:|;|\'|“|”|«|»|‘|’|<|>|\{|\}|\[|\]|@|#|\*|¦|\+|%|¬|&|\||¢|$|£|`|^|€|[0-9]|[0-9]+",
    " "), "[A-Z]{2,}") = FALSE, " ", REGEXREPLACE(
    REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(
            P3: P, "[a-z,]", " "), "-|\.", " "),
        "(^[A-Z]\s)", " "), "(\s[A-Z]\s)", " "),
    "\sI'|\sI\s|^I'|^I\s|\sI(\.|\?|\!)|\sI$|\sA\s|^A\s|\.\.\.|\.|-|--|,|\?|\!|\.|\(|\)|'|"
    "|:|;|\'|“|”|«|»|‘|’|<|>|\{|\}|\[|\]|@|#|\*|¦|\+|%|¬|&|\||¢|$|£|`|^|€|[0-9]|[0-9]+",
    " ")))

From this example:

Multiple regex matches in Google Sheets formula

Lod
  • 657
  • 1
  • 9
  • 30