Splitting paragraph into sentences

Question

I'm using the following Python code (which I found online a while ago) to split paragraphs into sentences.

def splitParagraphIntoSentences(paragraph):
  import re
  sentenceEnders = re.compile(r"""
      # Split sentences on whitespace between them.
      (?:               # Group for two positive lookbehinds.
        (?<=[.!?])      # Either an end of sentence punct,
      | (?<=[.!?]['"])  # or end of sentence punct and quote.
      )                 # End group of two positive lookbehinds.
      (?<!  Mr\.   )    # Don't end sentence on "Mr."
      (?<!  Mrs\.  )    # Don't end sentence on "Mrs."
      (?<!  Jr\.   )    # Don't end sentence on "Jr."
      (?<!  Dr\.   )    # Don't end sentence on "Dr."
      (?<!  Prof\. )    # Don't end sentence on "Prof."
      (?<!  Sr\.   )    # Don't end sentence on "Sr."."
    \s+               # Split on whitespace between sentences.
    """, 
    re.IGNORECASE | re.VERBOSE)
  sentenceList = sentenceEnders.split(paragraph)
  return sentenceList

I works just fine for my purpose, but now I need the exact same regex in Javascript (to make sure that the outputs are consistent) and I'm struggling to translate this Python regex into one compatible with Javascript.

bear in mind this: http://stackoverflow.com/questions/3569104/positive-look-behind-in-javascript-regular-expression — mplungjan, Sep 21 '15 at 13:01
Thank you for the pointer. Sounds like it'll be a pain to get the exact same behavior... — chrisvdb, Sep 21 '15 at 13:09

score 2 · Accepted Answer · answered Sep 21 '15 at 13:48

2

It is not regex for direct split, but kind of workaround:

(?!Mrs?\.|Jr\.|Dr\.|Sr\.|Prof\.)(\b\S+[.?!]["']?)\s

DEMO

You can replace matched fragment with for example: $1# (or other char not occuring in text, instead of #), and then split it with # DEMO. However it is not too elegant solution.

answered Sep 21 '15 at 13:48

m.cekiera

5,365
5
21
35

One quick follow up question: I wanted to add "St." to the list (Mr(s)., Dr., ...) because some locations like "St. Kilda" were handled incorrectly. Unfortunately, unlike with Mr(s)/Dr/... quite a few words end in "st" so adding "|St\." doesn't work. My solution of putting a space in front of "St." also does not work. Would you know a solution for that? – chrisvdb Sep 21 '15 at 15:26
@chrisvdb Are you sure it does't work? I cannot reproduce you problem [DEMO](https://regex101.com/r/nH6eH3/2). Could you post some examples? In any case, you could use something like: `(?!\b(?:St\.|Mrs?\.|Jr\.|Dr\.|Sr\.|Prof\.))(\b\S+[.?!]["']?)\s` with `\b` - word boundary - before alternatives. It means that the for example `St.` cannot be proceded by any word character `(0-9a-zA-Z_)` – m.cekiera Sep 21 '15 at 15:44
@chrisvdb Actually the `\b\S+` part should prevent, the problem you described, it would match only whole words, and nagative lookbehind can affects the same fragment – m.cekiera Sep 21 '15 at 15:48
Great, that works! Thanks again! Also, I'm using this trick to the splitting: https://jsfiddle.net/onssubo7/1/. I then use these indices to get the correct parts of the string. – chrisvdb Sep 21 '15 at 15:49
1

Here's a slight improvement, so that the expression doesn't pick up any title and doesn't pick up acronyms either. It only fails to pick the end of a sentence if the last word starts with caps, which would be true if that last word was a proper name. `(?![A-Z]..?\.)(?![A-Z]?\.)(\b\S+[.?!]["']?)\s` – Javier Cordero Dec 25 '18 at 22:59

Splitting paragraph into sentences

1 Answers1