35

I am trying to use javascript's split to get the sentences out of a string but keep the delimiter eg !?.

So far I have

sentences = text.split(/[\\.!?]/);

which works but does not include the ending punctuation for each sentence (.!?).

Does anyone know of a way to do this?

Ionică Bizău
  • 109,027
  • 88
  • 289
  • 474
daktau
  • 633
  • 1
  • 7
  • 17
  • 1
    `?` is also a special char in RegExp so you need to escape it – rgvcorley Aug 01 '12 at 14:37
  • 6
    Metacharacters like `.` and `?` lose their special meanings inside a character class. The correct way to match a dot (`.`), an exclamation point (`!`), or a question mark (`?`) is `[.!?]`. – Alan Moore May 12 '13 at 07:14

6 Answers6

71

You need to use match not split.

Try this.

var str = "I like turtles. Do you? Awesome! hahaha. lol!!! What's going on????";
var result = str.match( /[^\.!\?]+[\.!\?]+/g );

var expect = ["I like turtles.", " Do you?", " Awesome!", " hahaha.", " lol!!!", " What's going on????"];
console.log( result.join(" ") === expect.join(" ") )
console.log( result.length === 6);
Larry Battle
  • 9,008
  • 4
  • 41
  • 55
  • 1
    You can use a split: `text.split(/\b(?![\?\.\!])/);` \b tells it to split on word boundaries, the nifty part is the negative look-ahead. – bavo Dec 06 '15 at 23:35
  • 2
    The regex is wrong. If I type: "Phrase 1. Phrase 2. Phrase 3", "Phrase 3" gets thrown away. – Patricio Córdova Feb 05 '17 at 00:57
  • 4
    Here's a variation that also works when the last sentence ends without punctuation: `var result = str.match(/([^\.!\?]+[\.!\?]+)|([^\.!\?]+$)/g);` – Aneon Aug 24 '17 at 11:32
  • Wow, this stuff also catches ellipsis too. `var str = "I like turtles... Do you? Awesome! hahaha. lol!!! What's going on????"; ` – giorgio79 Nov 09 '17 at 11:38
  • 6
    this breaks when having floating point numbers: `Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum"` – androidu Oct 06 '18 at 12:09
  • It picks abbreviations such as P.S. and makes it P. S. Any clue how to manage abbreviations? – Vikas Roy Jan 22 '21 at 13:46
12

The following is a small addition to Larry's answer which will match also paranthetical sentences:

text.match(/\(?[^\.\?\!]+[\.!\?]\)?/g);

applied on:

text = "If he's restin', I'll wake him up! (Shouts at the cage.) 
'Ello, Mister Polly Parrot! (Owner hits the cage.) There, he moved!!!"

giveth:

["If he's restin', I'll wake him up!", " (Shouts at the cage.)", 
" 'Ello, Mister Polly Parrot!", " (Owner hits the cage.)", " There, he moved!!!"]
mircealungu
  • 6,831
  • 7
  • 34
  • 44
  • 1
    You missed the `+` after the punctuation character class `[.!?]`, so it will not capture the three exclamations after "he moved". – Mogsdad Sep 28 '15 at 23:56
8

Improving on lonemc's answer (which improved on Mia Chen's answer, which improved on mircealungu's answer):

First, we can stick a u option at the end in order to match unicode characters. In other words, we probably want to be able to parse German sentences, French sentences, etc.

Second, instead of hard-coding the characters that should end a sentence, we can use "Sentence_Terminal", which is part of the unicode standard.

Third, instead of hard-coding the characters that make up a closing bracket, we can use "Close_Punctuation".

Forth, instead of hard-coding the characters that make up a closing quote, we can use "Final_Punctuation".

Fifth, we might not want to match things that look like enums. For example:

This is the first sentence! This is the second sentence with MyEnum.Value1 where I talk about it!

In order to do that, we can compose a match using a lookahead pattern:

string.match(/(?=[^])(?:\P{Sentence_Terminal}|\p{Sentence_Terminal}(?!['"`\p{Close_Punctuation}\p{Final_Punctuation}\s]))*(?:\p{Sentence_Terminal}+['"`\p{Close_Punctuation}\p{Final_Punctuation}]*|$)/guy);

Here's a link to the regex on Regex101.com.

James
  • 1,394
  • 2
  • 21
  • 31
6

Try this instead:-

sentences = text.split(/[\\.!\?]/);

? is a special char in regular expressions so need to be escaped.

Sorry I miss read your question - if you want to keep delimiters then you need to use match not split see this question

Community
  • 1
  • 1
rgvcorley
  • 2,883
  • 4
  • 22
  • 41
  • 4
    Just a small note: special characters like `?` don't need to be escaped inside a character class (the square brackets). –  May 06 '16 at 16:58
4

A slight improvement on mircealungu's answer:

string.match(/[^.?!]+[.!?]+[\])'"`’”]*/g);
  • There's no need for the opening parenthesis at the beginning.
  • Punctuation like '...', '!!!', '!?' etc. are included inside sentences.
  • Any number of square close brackets and close parentheses are included. [Edit: different closing quotation marks added]
Mia Chen
  • 41
  • 2
4

Improving on Mia's answer here is a version which also includes ending sentences with no punctuation:

string.match(/[^.?!]+[.!?]+[\])'"`’”]*|.+/g)
lonemc
  • 41
  • 1