Javascript RegExp for splitting text into sentences and keeping the delimiter

Question

I am trying to use javascript's split to get the sentences out of a string but keep the delimiter eg !?.

So far I have

sentences = text.split(/[\\.!?]/);

which works but does not include the ending punctuation for each sentence (.!?).

Does anyone know of a way to do this?

`?` is also a special char in RegExp so you need to escape it — rgvcorley, Aug 01 '12 at 14:37
Metacharacters like `.` and `?` lose their special meanings inside a character class. The correct way to match a dot (`.`), an exclamation point (`!`), or a question mark (`?`) is `[.!?]`. — Alan Moore, May 12 '13 at 07:14

score 71 · Accepted Answer · answered Aug 01 '12 at 14:43

71

You need to use match not split.

Try this.

var str = "I like turtles. Do you? Awesome! hahaha. lol!!! What's going on????";
var result = str.match( /[^\.!\?]+[\.!\?]+/g );

var expect = ["I like turtles.", " Do you?", " Awesome!", " hahaha.", " lol!!!", " What's going on????"];
console.log( result.join(" ") === expect.join(" ") )
console.log( result.length === 6);

answered Aug 01 '12 at 14:43

Larry Battle

9,008
4
41
55

1

You can use a split: `text.split(/\b(?![\?\.\!])/);` \b tells it to split on word boundaries, the nifty part is the negative look-ahead. – bavo Dec 06 '15 at 23:35
2

The regex is wrong. If I type: "Phrase 1. Phrase 2. Phrase 3", "Phrase 3" gets thrown away. – Patricio Córdova Feb 05 '17 at 00:57
4

Here's a variation that also works when the last sentence ends without punctuation: `var result = str.match(/([^\.!\?]+[\.!\?]+)|([^\.!\?]+$)/g);` – Aneon Aug 24 '17 at 11:32
Wow, this stuff also catches ellipsis too. `var str = "I like turtles... Do you? Awesome! hahaha. lol!!! What's going on????"; ` – giorgio79 Nov 09 '17 at 11:38
6

this breaks when having floating point numbers: `Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum"` – androidu Oct 06 '18 at 12:09
It picks abbreviations such as P.S. and makes it P. S. Any clue how to manage abbreviations? – Vikas Roy Jan 22 '21 at 13:46

score 12 · Answer 2 · answered Jan 10 '14 at 00:30

12

The following is a small addition to Larry's answer which will match also paranthetical sentences:

text.match(/\(?[^\.\?\!]+[\.!\?]\)?/g);

applied on:

text = "If he's restin', I'll wake him up! (Shouts at the cage.) 
'Ello, Mister Polly Parrot! (Owner hits the cage.) There, he moved!!!"

giveth:

["If he's restin', I'll wake him up!", " (Shouts at the cage.)", 
" 'Ello, Mister Polly Parrot!", " (Owner hits the cage.)", " There, he moved!!!"]

answered Jan 10 '14 at 00:30

mircealungu

6,831
7
34
44

1

You missed the `+` after the punctuation character class `[.!?]`, so it will not capture the three exclamations after "he moved". – Mogsdad Sep 28 '15 at 23:56

James · Answer 3 · 2022-05-17T22:09:38.463

Improving on lonemc's answer (which improved on Mia Chen's answer, which improved on mircealungu's answer):

First, we can stick a u option at the end in order to match unicode characters. In other words, we probably want to be able to parse German sentences, French sentences, etc.

Second, instead of hard-coding the characters that should end a sentence, we can use "Sentence_Terminal", which is part of the unicode standard.

Third, instead of hard-coding the characters that make up a closing bracket, we can use "Close_Punctuation".

Forth, instead of hard-coding the characters that make up a closing quote, we can use "Final_Punctuation".

Fifth, we might not want to match things that look like enums. For example:

This is the first sentence! This is the second sentence with MyEnum.Value1 where I talk about it!

In order to do that, we can compose a match using a lookahead pattern:

string.match(/(?=[^])(?:\P{Sentence_Terminal}|\p{Sentence_Terminal}(?!['"`\p{Close_Punctuation}\p{Final_Punctuation}\s]))*(?:\p{Sentence_Terminal}+['"`\p{Close_Punctuation}\p{Final_Punctuation}]*|$)/guy);

Here's a link to the regex on Regex101.com.

score 6 · Answer 4 · edited May 23 '17 at 11:46

6

Try this instead:-

sentences = text.split(/[\\.!\?]/);

? is a special char in regular expressions so need to be escaped.

Sorry I miss read your question - if you want to keep delimiters then you need to use match not split see this question

edited May 23 '17 at 11:46

Community

1
1

answered Aug 01 '12 at 14:38

rgvcorley

2,883
4
22
41

4

Just a small note: special characters like `?` don't need to be escaped inside a character class (the square brackets). – May 06 '16 at 16:58

Mia Chen · Answer 5 · 2019-04-07T21:59:46.060

4

A slight improvement on mircealungu's answer:

string.match(/[^.?!]+[.!?]+[\])'"`’”]*/g);

There's no need for the opening parenthesis at the beginning.
Punctuation like '...', '!!!', '!?' etc. are included inside sentences.
Any number of square close brackets and close parentheses are included. [Edit: different closing quotation marks added]

edited Apr 07 '19 at 21:59

answered Apr 07 '19 at 04:38

Mia Chen

41
2

Is `...?` supported? – Abandoned Cart Feb 23 '20 at 11:39

score 4 · Answer 6 · answered Jun 22 '20 at 18:13

4

Improving on Mia's answer here is a version which also includes ending sentences with no punctuation:

string.match(/[^.?!]+[.!?]+[\])'"`’”]*|.+/g)

answered Jun 22 '20 at 18:13

lonemc

41
1

Javascript RegExp for splitting text into sentences and keeping the delimiter

6 Answers6

Linked

Related