0

I'm using this regex that was accepted as an answer in this question to split a paragraph into sentences, but the regex is splitting sentences at abbreviations/initials that contain/end with a period and at sentences that end with a quote. I'm unable to utilize negative lookbehinds as safari does not support them.

((?:[A-Z][a-z]\.|\w\.\w.|.)*?(?:[.!?]|$))(?:\s+|$)

The regex splits the following string:

"It was not a good translation because, according to Dr. Giles, "[I]t contains a great deal that Sun Tzŭ did not write, and very little indeed of what he did." The first translation into English was published in 1905 in Tokyo by Capt. E. F. Calthrop, R.F.A. However, this translation is"

into:

["It was not a good translation because, according to Dr. Giles, "[I]t contains a great deal that Sun Tzŭ did not write, and very little indeed of what he did." The first translation into English was published in 1905 in Tokyo by Capt.", 
"E.","F.", "Calthrop, R.F.A.", "However, this translation is"]

Intended output is:

1: "It was not a good translation because, according to Dr. Giles, "[I]t contains a great deal that Sun Tzŭ did not write, and very little indeed of what he did.""

2: "The first translation into English was published in 1905 in Tokyo by Capt. E. F. Calthrop, R.F.A."

3: "However, this translation is"

I am stumped in figuring out how to fix this. This seems to be an edge case sentence and the regex works for most sentences but the punctuation here is throwing it off.

Any ideas on how I can fix this while also keeping it compatible with Safari?

Crunch
  • 115
  • 1
  • 4
  • 14
  • 1
    What is the expected output? – trincot Feb 08 '23 at 18:09
  • @trincot 1: "It was not a good translation because, according to Dr. Giles, "[I]t contains a great deal that Sun Tzŭ did not write, and very little indeed of what he did." 2: The first translation into English was published in 1905 in Tokyo by Capt. E. F. Calthrop, R.F.A." 3: However, this translation is – Crunch Feb 08 '23 at 19:04
  • 1
    By which logic do you want to split "R.F.A. However," but keep "E. F. Calthrop," together? – trincot Feb 08 '23 at 19:10
  • @trincot I would like to keep "Capt. E. F. Calthrop, R.F.A." all together. Another issue is that the "It was not a good translation because, according to Dr. Giles, "[I]t contains a great deal that Sun Tzŭ did not write, and very little indeed of what he did." is not splitting properly. The punction inside of the quotes ends the sentence – Crunch Feb 08 '23 at 19:14
  • 1
    I understand that, but you didn't answer my question. By which logic do you determine that there should be a split before "However," but not before "Calthrop,"? – trincot Feb 08 '23 at 19:24
  • @trincot I see what you're saying, so it's not logically possible to split this paragraph with safari compatible regex? – Crunch Feb 08 '23 at 20:06
  • 1
    Sure it is possible to split it, but before you code something you first need to define what you actually want to implement. You must be able to describe the rules. What are the rules? Forget regex and Safari for a moment and describe in English what the rules are for splitting. – trincot Feb 08 '23 at 20:07
  • @trincot The rules that I'm trying to implement are 1: If there is a punctuation mark (.!?) between 2 quotation marks, followed by a capital letter then split. 2: Zero or more occurences of an upper case letter, period, space, uppercase letter, and then any whitespace followed by a word char and, followed by a space or any other single char – Crunch Feb 08 '23 at 22:29
  • 1
    You can't split sentences unless you know what a sentence is. The closest you can get is to split on words `[^\pL\pN]*[\pL\pN](?:[\pL\pN_-]|\pP(?=[\pL\pN\pP_-]))*` or `[\W_]*[^\W_](?:\w|[[:punct:]_-](?=[\w[:punct:]-]))*` – sln Feb 09 '23 at 00:55
  • @sln this solution didn't work – Crunch Feb 09 '23 at 02:22
  • @sln I will make the problem statement a bit more informative in the am! – Crunch Feb 09 '23 at 02:22
  • 1
    @Crunch The rules you mentioned in the comment above is not meeting the criteria of the output you are expecting. Can you please make it clear what you actually want to implement. – Debug Diva Feb 09 '23 at 05:18
  • The rules you describe don't explain why you want to split in the middle of "R.F.A. However," but keep "E. F. Calthrop," together in the same string. I'm voting to close this question as not clear what you want. – trincot Feb 09 '23 at 07:16
  • @trincot I'm not looking for a lecture, I'm looking for an answer to my question. I am not a regex expert and have only used it for much smaller tasks. I've already explained exactly what the purpose of this regex is, which is to split paragraphs into sentences, and I don't know what more you want me to explain to you. – Crunch Feb 09 '23 at 20:32
  • I am not giving a lecture? Why do you say that? Just trying to understand what the rule is, and I want to understand it. Sorry I offended you. I will just silently move on. – trincot Feb 09 '23 at 20:36

0 Answers0