How To Extract Uppercase First Letter Multiple Words Per Cells Only Ideally Ignoring First Words of Sentences With Google Sheets REGEXEXTRACT Formula?

Question

I'm trying to extract all words with Uppercase initial letter from a text, with the REGEXEXTRACT formula in google sheets.

Ideally the first word of sentences should be ignored and only all subsequent words with first Uppercase letter should be extracted.

The Regex Pattern I Use:

I also found this regex [A-ZÖ][a-zö]+ pattern that works well to get all the Uppercase first letter words.

The problem is that it's not ignoring the first words of sentences.

Other Python Solution Vs Google Sheets Formula:

I've also found this python tutorial and script to do it:

Proper Noun Extraction in Python using NLP in Python

# Importing the required libraries
import nltk 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize, sent_tokenize

# Function to extract the proper nouns 

def ProperNounExtractor(text):
    
    print('PROPER NOUNS EXTRACTED :')
    
    sentences = nltk.sent_tokenize(text)
    for sentence in sentences:
        words = nltk.word_tokenize(sentence)
        words = [word for word in words if word not in set(stopwords.words('english'))]
        tagged = nltk.pos_tag(words)
        for (word, tag) in tagged:
            if tag == 'NNP': # If the word is a proper noun
                print(word)

text =  """Down the Rabbit-Hole
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?'
So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.

There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, `Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge."""

# Calling the ProperNounExtractor function to extract all the proper nouns from the given text. 
ProperNounExtractor(text)

It works well, but I the idea in doing it in Google Sheets is to have the Uppercase first letter Words adjacent to the text in a table format for more convenient reference.

Question Summary:

How would you adjust my formula in the sample sheet below

=ARRAYFORMULA(IF(A1:A="","",REGEXEXTRACT(A1:A,"[A-ZÖ][a-zö]+")))

to add those functions:

Extract all the first Uppercase letter Words from each cell with text
Ignore the first words of sentences
return all the first Uppercase letter Words save the first words from sentences into the adjacent cells, one word per cell (similar to this example (from the 2nd Close question above): )

Sample Sheet:

Here's my testing Sample Sheet

Many thanks for your help!

Wiktor Stribiżew · Accepted Answer · 2022-01-05T10:55:26.260

You can use

=ARRAYFORMULA(SPLIT(REGEXREPLACE(REGEXREPLACE(A111:A, "(?:[?!]|\.(?:\.\.+)?)\s+", CHAR(10)), "(?m)^\s*[[:upper:]][[:alpha:]]*|.*?([[:upper:]][[:alpha:]]*|$)", "$1" & char(10)), CHAR(10)))

Or, to make sure the ?, ! or . / ... that are matched as sentence boundaries are followed with an uppercase letter:

=ARRAYFORMULA(SPLIT(REGEXREPLACE(REGEXREPLACE(A111:A, "(?:[?!]|\.(?:\.\.+)?)\s+([[:upper:]])", CHAR(10) & "$1"), "(?m)^\s*[[:upper:]][[:alpha:]]*|.*?([[:upper:]][[:alpha:]]*|$)", "$1" & char(10)), CHAR(10)))

See the demo screenshot:

See the regex demo.

First, we split the text into sentences in a cell with REGEXREPLACE(A111:A, "(?:[?!]|\.(?:\.\.+)?)\s+", CHAR(10)). Actually, this just replaces final sentence punctuation with a newline.

The second REGEXREPLACE is used with another regex that matches

(?m)^\s*[[:upper:]][[:alpha:]]* - a capitalized word ([[:upper:]][[:alpha:]]*) at the start of string/line (^) together with optional whitespace (\s*)
| - or
.*? - any zero or more chars other than line break chars, as few as possible
([[:upper:]][[:alpha:]]*|$) - Group 1 ($1): an uppercase letter ([[:upper:]]) and then any zero or more letters ([[:alpha:]]*), or end of string ($)

and replaces the match with Group 1 value and a newline, LF char. Then, the result is SPLIT with a newline char.

Thanks a lot for the reply and detailed explanation Wiktor! It works very well! I'll study your method in more details. Be well! — Lod, Jan 05 '22 at 11:03
@Lod I see `[:upper:]` and `[:alpha:]` still do not support all Unicode letters, only ASCII, in Google sheets RE2 sub-flavor, so you will probably have to revert your `[A-ZÖ][a-zö]+` pattern. — Wiktor Stribiżew, Jan 05 '22 at 11:04
Oh thanks for the note, I had not thought about the non-english/latin text part of the problem (I was only planning to use it on regular english/latin based text, so your formula should be enough). I've tried to modify it with the previous `[A-ZÖ][a-zö]+` pattern but without success so far. I'll see later if I can make it work. Thanks again! — Lod, Jan 05 '22 at 11:32

JvdV · Answer 2 · 2022-01-05T12:57:13.427

My two cents:

Formula in B1:

=INDEX(IF(A1:A<>"",SPLIT(REGEXREPLACE(A1:A,"(?:(?:^|[.?!]+)\s*\S+|\b([A-ZÖ][a-zö]+(?:-[A-ZÖ][a-zö]+)*)\b|.+?)","$1|"),"|",1),""))

The pattern: (?:(?:^|[.?!]+)\s*\S+|\b([A-ZÖ][a-zö]+(?:-[A-ZÖ][a-zö]+)*)\b|.+?) means:

(?: - Open non-capture group to allow for alternations:
- (?:^|[.?!]+)\s*\S+ - A nested non-capture group to allow for the start-line anchor or 1+ literal dots or question/exclamation marks, followed by 0+ whitespace chars and 1+ non-whitespace chars;
- | - Or;
- \b([A-ZÖ][a-zö]+(?:-[A-ZÖ][a-zö]+)*)\b - A 1st capture-group to catch camel-case strings (with optional hyphen) between word-boundaries;
- | - Or;
- .+? - Any 1+ characters (Lazy);
- ) - Close non-capture group.

The idea is here to use REGEXREPLACE() to substitute any match with the backreference to the 1st capture group and a pipe-symbol (or any symbol for that matter that won't be in your input) and use SPLIT() to get all words seperated. Note that it is important to use the 3rd parameter of the function to ignore empty strings.

INDEX() will trigger the array-functionality and spill the results. I used an nested IF() statement to check for empty cells to skip.

Thanks a lot JvdV for the added camel-case and Unicode solution! Very interesting method and insights! I'll study it more in details next asap! Be well! — Lod, Jan 05 '22 at 12:53
@WiktorStribiżew, I don't know where this unicode-aware discussion comes from, but I just wanted to share another (shorter) method to get the answers OP was looking for. — JvdV, Jan 07 '22 at 08:01

pavel · Answer 3 · 2022-07-24T06:37:07.500

1

Since you have already found a Python solution for your use case, you may try utilizing that directly as a custom function in your Google Sheet by hosting the Python code as an API and using Google App script to call it and extract natively in the Google Sheet.

For reference, you can check this repository and this YouTube video.

edited Jul 24 '22 at 06:37

answered Jul 24 '22 at 06:36

pavel

11
2

How To Extract Uppercase First Letter Multiple Words Per Cells Only Ideally Ignoring First Words of Sentences With Google Sheets REGEXEXTRACT Formula?

Other Close Questions and Formulas:

The Regex Pattern I Use:

Other Python Solution Vs Google Sheets Formula:

Question Summary:

Sample Sheet:

3 Answers3