How do I isolate words in a string, including abbreviated names?

Asked Jul 21 '20 at 00:06

Active Jul 21 '20 at 00:10

Viewed 28 times

I have the following code, which takes in a string and returns the individual words of the string minus any punctuation:

def word_split(quote):
    return re.findall(r'\w+', quote.lower())

output: ['to', 'me', 'there', 'has', 'never', 'been', 'a', 'higher', 'source', 'of', 'earthly', 'honor', 'or', 'distinction', 'than', 'that', 'connected', 'with', 'advances', 'in', 'science', 'isaac', 'newton']

However, in certain instances, there are author names like J.K. Rowling where the code would split her name at the J and K. Is there a way I can re-write this code that wouldn't split those abbreviated names?

edited Jul 21 '20 at 00:10

Barmar

741,623
53
500
612

asked Jul 21 '20 at 00:06

CuriousStatistician

You need to use a natural language process library, regular expressions can't do this reliably. – Barmar Jul 21 '20 at 00:10
I think using AI is your best bet. I can only imagine how many false positives you'd get even if you came up with some sophisticated non-AI system. – Bren Jul 21 '20 at 00:15

How do I isolate words in a string, including abbreviated names?

0 Answers0