Sentence splitting based in regular expression

Question

I am trying to split an article into sentences. And using the following code (written by someone who has left organization). Help me understand the code

re.split(r' *[.,:-\@/_&?!;][\s ]+', x)

Which part don't you understand? The split, or the regex pattern? — OneCricketeer, Jun 22 '17 at 11:36

score 1 · Answer 1 · answered Jun 22 '17 at 13:28

It looks for punctuation marks such as stops, commas and colons, optionally preceded by spaces and always followed by at least one whitespace character. In the commonest case that will be ". ". Then it splits the string x into pieces by removing the matched punctuation and returning whatever is left as a list.

>>> x = "First sentence. Second sentence? Third sentence."
>>> re.split(r' *[.,:-\@/_&?!;][\s ]+', x) 
['First sentence', 'Second sentence', 'Third sentence.']

The regular expression is unnecessarily complex and doesn't do a very good job.

This bit: :-\@ has a redundant quoting backslash, and means the characters between ascii 58 and 64, in other words : ; < = > ? @, but it would be better to list the 7 characters explicitly, because most people will not know what characters fall in that range. That includes me: I had to look it up. And it clearly also includes the code's author, since he redundantly specified ; again at the end.

This bit [\s ]+ means one or more spaces or whitespace characters but a space is a whitespace character so that could be more simply expressed as \s+.

Note the retained full stop in the 3rd element of the returned list. That is because when the full stop comes at the end of the line, it is not followed by a space, and the regular expression insists that it must be. Retaining the full stop is okay, but only if it is done consistently for all sentences, not just for the ones that end at a line break.

Throw away that bit of code and start from scratch. Or use nltk, which has power tools for splitting text into sentences and is likely to do a much more respectable job.

>>> import nltk
>>> sent_tokenizer=nltk.punkt.PunktSentenceTokenizer()
>>> sent_tokenizer.sentences_from_text(x)
['First sentence.', 'Second sentence?', 'Third sentence.']

Sentence splitting based in regular expression

1 Answers1