0

I'm using the regex that was accepted as an answer in this question to split sentences, but the regex is not compatible in safari since it does not support negative lookbehinds (yet).

(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s

The regex splits the following string:

Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.

Into:

[
"Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it."
"Did he mind?"
"Adam Jones Jr. thinks he didn't."
"In any case, this isn't true..."
"Well, with a probability of .9 it isn't."
]

It's basically a sentence extractor from a string.

Any ideas on how to make it compatible with safari?

2 Answers2

0

This regex can be re-written for use in match/matchAll or exec:

/((?:[A-Z][a-z]\.|\w\.\w[\w\W]|[\w\W])*?[.!?])(?:\s+|$)/g

See the regex demo. Details:

  • ((?:[A-Z][a-z]\.|\w\.\w[\w\W]|[\w\W])*?[.!?]) - Group 1:
    • (?:[A-Z][a-z]\.|\w\.\w[\w\W]|[\w\W])*? - zero or more occurrences of
      • [A-Z][a-z]\.| - an upper, lower case letter and then a ., or
      • \w\.\w[\w\W]| - a word char, ., word char and then any one char
      • [\w\W] - any single char
    • [.!?] - a ., ! or ?
  • (?:\s+|$) - one or more whitespaces or end of string.

See the JavaScript demo:

var s = "Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."
var rx = /((?:[A-Z][a-z]\.|\w\.\w[\w\W]|[\w\W])*?[.!?])(?:\s+|$)/g
var results = [], m;
while(m=rx.exec(s)) {
    results.push(m[1]);
}
console.log(results)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I have only one problem with this approach. Let's say the last sentence "Well, with a probability of .9 it isn't." does not have a "." in the end. That string won't be added to the array of sentences. How can i add it even if it does not end normally like the other sentences. – curiousDev May 21 '21 at 12:56
  • @curiousDev Then use `/((?:[A-Z][a-z]\.|\w\.\w.|.)*?(?:[.!?]|$))(?:\s+|$)/g`, see [demo](https://regex101.com/r/FIP1an/2). – Wiktor Stribiżew May 21 '21 at 13:40
  • Awesome! One problem though. The js script crashes when i run it with the given regex and consumes to much memory. Do i need to modify the js script? – curiousDev May 21 '21 at 14:09
  • @curiousDev Sorry, in JS, there is little room for enhancing such patterns. You might want to re-formulate the requirements, then we can think of an alternative pattern. – Wiktor Stribiżew May 21 '21 at 14:14
  • Hmmm.... I don't really know what other way to go about it though. Can negative lookbehinds and negative lookaheads be expressed differently? That's the main problem why the original regex does not work in safari – curiousDev May 21 '21 at 14:26
  • This is what I suggested, converted lookbehinds into consuming pattern. – Wiktor Stribiżew May 21 '21 at 14:27
  • 1
    Sorry it's just the lookbehind assertions that do not work. – curiousDev May 21 '21 at 14:29
  • @curiousDev Hello, this doesn't work for the last sentence in this paragraph, how could this be implemented?: "It was not a good translation because, according to Dr. Giles, "[I]t contains a great deal that Sun Tzŭ did not write, and very little indeed of what he did." The first translation into English was published in 1905 in Tokyo by Capt. E. F. Calthrop, R.F.A. However, this translation is" – Crunch Feb 10 '23 at 22:58
0

Okay so based on the regex that Wiktor Stribiżew (big thanks) gave i can do something like this.

const string = "My perfect string"
const regex = /((?:[A-Z][a-z]\.|\w\.\w.|.)*?(?:[.!?]|$))(?:\s+|$)/g
const sentences = string.match(regex)

This will give me the sentences but notice that i need to delete one of the elements in the array since it returns an empty value.

[
  'Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. ',
  'Did he mind? ',
  "Adam Jones Jr. thinks he didn't. ",
  "In any case, this isn't true... ",
  "Well, with a probability of .9 it isn't.",
  ''
]

And then by removing the last element from the array we can get what is intended.

sentences.pop()

And if you need to remove the spaces at the end of the sentences you can just trim them

let sentences = string.match(regex).map(sentence => {
   return sentence.trim()
})