0

I have the following code, which takes in a string and returns the individual words of the string minus any punctuation:

def word_split(quote):
    return re.findall(r'\w+', quote.lower())

output: ['to', 'me', 'there', 'has', 'never', 'been', 'a', 'higher', 'source', 'of', 'earthly', 'honor', 'or', 'distinction', 'than', 'that', 'connected', 'with', 'advances', 'in', 'science', 'isaac', 'newton']

However, in certain instances, there are author names like J.K. Rowling where the code would split her name at the J and K. Is there a way I can re-write this code that wouldn't split those abbreviated names?

Barmar
  • 741,623
  • 53
  • 500
  • 612
  • You need to use a natural language process library, regular expressions can't do this reliably. – Barmar Jul 21 '20 at 00:10
  • I think using AI is your best bet. I can only imagine how many false positives you'd get even if you came up with some sophisticated non-AI system. – Bren Jul 21 '20 at 00:15

0 Answers0