2

I am using a very extensive formula to remove all words from a string that start with lowercase letters and contains numerous special characters and other signs and symbols. The goal is to end up with only words that start with uppercase letters (if two words start with uppercase letters right after each other, then they are counted as one word). For example

This is the input

Obama Mama told: Reporters in 19. Washington-Post in That he and Netanyahu-lll are opposed to Iran's calls for "Death To America".

And this is the expected output

Obama Mama, Reporters, Washington-Post, That, Netanyahu-lll, Irans, Death To America

And this is the formula

=Regexreplace(REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(INDEX(TEXTJOIN(" "; 1; LAMBDA(x;IF(REGEXMATCH(x&"";"(^[0-9a-zäüö])");"_";x))(SPLIT(G7;" "&CHAR(10)))));"(.*)\/|\|.*|\(.*\) |\.|»| - .*$| – |!|\?|\+|\„|\“|%| \& | \& |'|»|«|""";"");"(:| --)";" _");"(^[_\s]+|[\s_]+$)";"");"\s_+";",");"([,]+|,\s)";",")

This also works very well. I just wonder if there isn't a simpler way that just extracts all the words with capital letters.

Is there such a thing as a multiple REGEXEXTRACT? This one is only extracting the first word:

=regexextract(G7;"\b[A-Z].*?\b")
TheMaster
  • 45,448
  • 6
  • 62
  • 85
user3392296
  • 644
  • 4
  • 16
  • 1
    How about `=REGEXREPLACE(REGEXREPLACE(A1;"[^A-ZÄÖÜ]*((?:\s*[A-ZÄÖÜ][a-zäöüß'-]*)+)[^A-ZÄÖÜ]*";"$1,");"[ ,]+$";"")` → [see Regex101 demo](https://regex101.com/r/mSsZsb/1) – bobble bubble Oct 29 '22 at 12:04
  • Single regex, Close but no cigar: `=REGEXREPLACE(A1,"((\s|^)[a-z0-9][\w']+[.,:'""]*)+|([.,:""]+)",",")` – TheMaster Oct 29 '22 at 13:26

4 Answers4

3

Matching your output exactly, I can get it down to three regex. Try this:

=regexreplace(regexreplace(regexreplace(G7,"[^A-Za-z\'\-\ ""]",""),"\ [a-z\ ]+\ ",", "),"[\'""]","")

If you need the output in separate cells, use SPLIT() around the regex.

enter image description here

Aresvik
  • 4,484
  • 1
  • 5
  • 18
3

I wonder if it's possible with only one REGEXREPLACE but I doubt it. Another idea with two replaces:

=REGEXREPLACE(REGEXREPLACE(A1;
 "[^A-ZÄÖÜ]*((?:\s*[A-ZÄÖÜ][a-zäöüß'-]*)+)[^A-ZÄÖÜ]*";"$1,");"[ ,]+$";"")

enter image description here

See this demo at regex101 - What this pattern does, is to capture the wanted parts (also taking umlauts into account) and replace unwanted with what was captured to the first group $1 and , after it. The second (outer) replacement is just used to trim the end from added comma and space.

bobble bubble
  • 16,888
  • 3
  • 27
  • 46
  • Thanks a lot! Also, is there an option to add exceptions to the lowercase words that should not be removed? For example, all words in the range of A3:A – user3392296 Nov 02 '22 at 11:20
  • 1
    @user3392296 Youre welcome! You could add exceptions manually to the capture group e.g. like [in this demo](https://regex101.com/r/i6NQ3m/1). Not sure if that's what you meant else I can't help much with it, not doing much in Google Sheets. – bobble bubble Nov 02 '22 at 12:00
2

Is there such a thing as a multiple REGEXEXTRACT?

If you provide multiple capture groups(()), you'll get multiple extracts.

We can adapt the technique mentioned by @MaxMakhrov here, where the original string is converted to a regex with capture groups and passed as a regex argument to REGEXEXTRACT.

="\Q"&REGEXREPLACE(A1,"(\b[A-Z][a-z']+(\s|-\w+|\b))+","\\E(.*)\\Q")&"\E"
  • \Q...\E QuotE. Anything between this is not treated as a regex meta character.
  • \b word border
  • [A-Z][a-z']+ uppercase letter followed by one or many lowercase letter or a '(This ' is to include ' in Iran's)
  • (\s|-\w+|\b) Ending with a space or a -(dash) and more word characters or word border.
  • The whole regex can be repeated multiple times((...)+). This is to satisfy if two words start with uppercase letters right after each other, then they are counted as one word

For the input,

Obama Mama told: Reporters in 19. Washington-Post in That he and Netanyahu-lll are opposed to Iran's calls for "Death To America".

This should become

"\Q\E(.*)\Qtold: \E(.*)\Qin 19. \E(.*)\Q in \E(.*)\Qhe and \E(.*)\Q are opposed to \E(.*)\Qcalls for ""\E(.*)\Q"".\E"

If we pass to REGEXEXTRACT, we can get all the uppercase words. Then we JOIN them.

=ARRAYFORMULA(JOIN(", ",TRIM(REGEXEXTRACT(A1,"\Q"&REGEXREPLACE(A1,"(\b[A-Z][a-z']+(\s|-\w+|\b))+","\\E(.*)\\Q")&"\E"))))

Output:

Obama Mama, Reporters, Washington-Post, That, Netanyahu-lll, Iran's, Death To America

Note: Apostrophe ' in Iran's is present, as it is unclear whether you want to remove it.

TheMaster
  • 45,448
  • 6
  • 62
  • 85
-2

try:

=INDEX(TEXTJOIN(", ", 1, LAMBDA(x, 
 IFERROR(IF(REGEXMATCH(x, "^[a-z]"),,x)))
 (SPLIT(REGEXREPLACE(A1, "[""\:\.,'!\?\(\)]", ), " "))))

enter image description here

player0
  • 124,011
  • 12
  • 67
  • 124