3

I'm using the following Python code (which I found online a while ago) to split paragraphs into sentences.

def splitParagraphIntoSentences(paragraph):
  import re
  sentenceEnders = re.compile(r"""
      # Split sentences on whitespace between them.
      (?:               # Group for two positive lookbehinds.
        (?<=[.!?])      # Either an end of sentence punct,
      | (?<=[.!?]['"])  # or end of sentence punct and quote.
      )                 # End group of two positive lookbehinds.
      (?<!  Mr\.   )    # Don't end sentence on "Mr."
      (?<!  Mrs\.  )    # Don't end sentence on "Mrs."
      (?<!  Jr\.   )    # Don't end sentence on "Jr."
      (?<!  Dr\.   )    # Don't end sentence on "Dr."
      (?<!  Prof\. )    # Don't end sentence on "Prof."
      (?<!  Sr\.   )    # Don't end sentence on "Sr."."
    \s+               # Split on whitespace between sentences.
    """, 
    re.IGNORECASE | re.VERBOSE)
  sentenceList = sentenceEnders.split(paragraph)
  return sentenceList

I works just fine for my purpose, but now I need the exact same regex in Javascript (to make sure that the outputs are consistent) and I'm struggling to translate this Python regex into one compatible with Javascript.

chrisvdb
  • 2,080
  • 2
  • 20
  • 28
  • 1
    bear in mind this: http://stackoverflow.com/questions/3569104/positive-look-behind-in-javascript-regular-expression – mplungjan Sep 21 '15 at 13:01
  • Thank you for the pointer. Sounds like it'll be a pain to get the exact same behavior... – chrisvdb Sep 21 '15 at 13:09

1 Answers1

2

It is not regex for direct split, but kind of workaround:

(?!Mrs?\.|Jr\.|Dr\.|Sr\.|Prof\.)(\b\S+[.?!]["']?)\s

DEMO

You can replace matched fragment with for example: $1# (or other char not occuring in text, instead of #), and then split it with # DEMO. However it is not too elegant solution.

m.cekiera
  • 5,365
  • 5
  • 21
  • 35
  • One quick follow up question: I wanted to add "St." to the list (Mr(s)., Dr., ...) because some locations like "St. Kilda" were handled incorrectly. Unfortunately, unlike with Mr(s)/Dr/... quite a few words end in "st" so adding "|St\." doesn't work. My solution of putting a space in front of "St." also does not work. Would you know a solution for that? – chrisvdb Sep 21 '15 at 15:26
  • @chrisvdb Are you sure it does't work? I cannot reproduce you problem [DEMO](https://regex101.com/r/nH6eH3/2). Could you post some examples? In any case, you could use something like: `(?!\b(?:St\.|Mrs?\.|Jr\.|Dr\.|Sr\.|Prof\.))(\b\S+[.?!]["']?)\s` with `\b` - word boundary - before alternatives. It means that the for example `St.` cannot be proceded by any word character `(0-9a-zA-Z_)` – m.cekiera Sep 21 '15 at 15:44
  • @chrisvdb Actually the `\b\S+` part should prevent, the problem you described, it would match only whole words, and nagative lookbehind can affects the same fragment – m.cekiera Sep 21 '15 at 15:48
  • Great, that works! Thanks again! Also, I'm using this trick to the splitting: https://jsfiddle.net/onssubo7/1/. I then use these indices to get the correct parts of the string. – chrisvdb Sep 21 '15 at 15:49
  • 1
    Here's a slight improvement, so that the expression doesn't pick up any title and doesn't pick up acronyms either. It only fails to pick the end of a sentence if the last word starts with caps, which would be true if that last word was a proper name. `(?![A-Z]..?\.)(?![A-Z]?\.)(\b\S+[.?!]["']?)\s` – Javier Cordero Dec 25 '18 at 22:59