28

Currently i am working on an application that splits a long column into short ones. For that i split the entire text into words, but at the moment my regex splits numbers too.

What i do is this:

str = "This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence.";
sentences = str.replace(/\.+/g,'.|').replace(/\?/g,'?|').replace(/\!/g,'!|').split("|");

The result is:

Array [
    "This is a long string with some numbers [125.",
    "000,55 and 140.",
    "000] and an end.",
    " This is another sentence."
]

The desired result would be:

Array [
    "This is a long string with some numbers [125.000, 140.000] and an end.",
    "This is another sentence"
]

How do i have to change my regex to achieve this? Do i need to watch out for some problems i could run into? Or would it be good enough to search for ". ", "? " and "! "?

Tobias Golbs
  • 4,586
  • 3
  • 28
  • 49
  • Can you change the string or is this not an option? – Beejee Sep 20 '13 at 10:38
  • Are you looking for a working regex that would get the desired result (or) you already know that and want suggestions on other potential problems with it? – Harry Sep 20 '13 at 10:39
  • @Beejee: I could manipulate the string. – Tobias Golbs Sep 20 '13 at 10:48
  • *'Or would it be good enough to search for `". "`, `"? "` and `"! "`?'* - No, because it doesn't allow for use of `". "` in an abbreviation: "Should we go to the F.B.I. or the Grammar Police?" – nnnnnn Aug 19 '16 at 07:29

8 Answers8

43
str.replace(/([.?!])\s*(?=[A-Z])/g, "$1|").split("|")

Output:

[ 'This is a long string with some numbers [125.000,55 and 140.000] and an end.',
  'This is another sentence.' ]

Breakdown:

([.?!]) = Capture either . or ? or !

\s* = Capture 0 or more whitespace characters following the previous token ([.?!]). This accounts for spaces following a punctuation mark which matches the English language grammar.

(?=[A-Z]) = The previous tokens only match if the next character is within the range A-Z (capital A to capital Z). Most English language sentences start with a capital letter. None of the previous regexes take this into account.


The replace operation uses:

"$1|"

We used one "capturing group" ([.?!]) and we capture one of those characters, and replace it with $1 (the match) plus |. So if we captured ? then the replacement would be ?|.

Finally, we split the pipes | and get our result.


So, essentially, what we are saying is this:

1) Find punctuation marks (one of . or ? or !) and capture them

2) Punctuation marks can optionally include spaces after them.

3) After a punctuation mark, I expect a capital letter.

Unlike the previous regular expressions provided, this would properly match the English language grammar.

From there:

4) We replace the captured punctuation marks by appending a pipe |

5) We split the pipes to create an array of sentences.

hrust
  • 734
  • 1
  • 11
  • 34
JSPP
  • 2,733
  • 1
  • 16
  • 11
  • This solution fails if a sentence starts with a number. – Tibos Sep 20 '13 at 10:52
  • You can modify it to this: /([.?!])\x20{1,2}(?=[A-Z\d])/. However, this would expect that A) decimal numbers do not have spaces after them, and B) there is either one or two space characters following a punctuation mark. This would conform with English grammar. If you cannot accept condition A, there would be an ambiguity in the grammar you are attempting to parse. – JSPP Sep 20 '13 at 10:58
  • More on grammatical ambiguity in computer science: http://en.wikipedia.org/wiki/Ambiguous_grammar . Essentially, in your situation, numbers with a decimal separator and punctuation marks for new sentences need to be grammatically distinguishable. The revised regex I provided conforms with the English language grammar. – JSPP Sep 20 '13 at 11:03
  • I fail to see how ignoring condition A leads to an ambiguous grammar. The dot ambiguity can be solved (imperfectly, but still a very practical solution) with a couple of rules: 1) dot between two digits is a decimal separator; 2) dot between anything except two digits is a punctuation mark - sentence separator. – Tibos Sep 20 '13 at 11:09
  • What about "My daughter is 10. 10 more years from now, she will be 20." ? – JSPP Sep 20 '13 at 11:12
  • Anything includes spaces and end-of-string. If you feel the wording is ambigous, you can consider that condition 2) is: any dot that doesn't match condition 1 is a punctuation mark. The key difference in our approaches is that you are attempting to match the sentence separator dot in a inherently complex context, while i am matching the decimal separator in a simple context. – Tibos Sep 20 '13 at 11:12
  • Sorry, I intended to edit condition A to use the word "digits" rather than "numbers," but apparently I can't edit after 5 mins or so. Digits following a decimal point cannot have spaces before them as in the English grammar. If he permits that, then it becomes ambiguous as in the example sentence I provided. Your proposed conditions were already considered (and satisfied) in my revised regex, no? – JSPP Sep 20 '13 at 11:22
  • The revised regexp assumes spaces.So this comment would not be split properly. – Tibos Sep 20 '13 at 11:39
  • Yes, and this was a condition included with the revised regex. – JSPP Sep 20 '13 at 11:51
  • Tibos, your last comment on spacing also directly contradicts and negates your previous remarks on ambiguity. You're grasping at straws. – JSPP Sep 20 '13 at 12:26
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/37735/discussion-between-tibos-and-rogerp) – Tibos Sep 20 '13 at 12:31
  • You can't assume a full stop followed by a space is the end of a sentence, because what about abbreviations in the middle of sentences? "Should I go to the F.B.I. or the Grammar Police?" – nnnnnn Aug 19 '16 at 07:27
  • This is a nice generalized solution, but will fail if you write something like, "In the U.S. it can be seen." Which will split to, "In the U." , "S. it can be seen." To work around this, you'd want to change the \s* to \s{1} or \s{x,y} for a range of spaces between x and y. – blau Oct 06 '22 at 16:58
15
str.replace(/(\.+|\:|\!|\?)(\"*|\'*|\)*|}*|]*)(\s|\n|\r|\r\n)/gm, "$1$2|").split("|")

The RegExp (see on Debuggex):

  • (.+|:|!|\?) = The sentence can end not only by ".", "!" or "?", but also by "..." or ":"
  • (\"|\'|)*|}|]) = The sentence can be surrounded by quatation marks or parenthesis
  • (\s|\n|\r|\r\n) = After a sentense have to be a space or end of line
  • g = global
  • m = multiline

Remarks:

  • If you use (?=[A-Z]), the the RegExp will not work correctly in some languages. E.g. "Ü", "Č" or "Á" will not be recognised.
Antonín Slejška
  • 1,980
  • 4
  • 29
  • 39
7

You could exploit that the next sentence begins with an uppercase letter or a number.

.*?(?:\.|!|\?)(?:(?= [A-Z0-9])|$)

Regular expression visualization

Debuggex Demo

It splits this text

This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence. Sencenes beginning with numbers work. 10 people like that.

into the sentences:

This is a long string with some numbers [125.000,55 and 140.000] and an end.
This is another sentence.
Sencenes beginning with numbers work.
10 people like that.

jsfiddle

Community
  • 1
  • 1
tessi
  • 13,313
  • 3
  • 38
  • 50
  • This is great, I just noticed it doesn't handle bad user input such as"Jim went to the store.Larry slept until 12.But Becky left for the weekend." But, this is beyond the scope of the question. I just mention it for anyone like myself who might be looking for a quick regexp to handle this. – Quinxy von Besiex Oct 03 '16 at 00:00
  • This also does not handle ? or ! – AWhatley Aug 16 '20 at 13:25
5

Use lookahead to avoid replacing dot if not followed by space + word char:

sentences = str.replace(/(?=\s*\w)\./g,'.|').replace(/\?/g,'?|').replace(/\!/g,'!|').split("|");

OUTPUT:

["This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence."]
anubhava
  • 761,203
  • 64
  • 569
  • 643
4

You're safer using lookahead to make sure what follows after the dot is not a digit.

var str ="This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence."

var sentences = str.replace(/\.(?!\d)/g,'.|');
console.log(sentences);

If you want to be even safer you could check if what is behind is a digit as well, but since JS doesn't support lookbehind, you need to capture the previous character and use it in the replace string.

var str ="This is another sentence.1 is a good number"

var sentences = str.replace(/\.(?!\d)|([^\d])\.(?=\d)/g,'$1.|');
console.log(sentences);

An even simpler solution is to escape the dots inside numbers (replace them with $$$$ for example), do the split and afterwards unescape the dots.

Tibos
  • 27,507
  • 4
  • 50
  • 64
3

you forgot to put '\s' in your regexp.

try this one

var str = "This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence.";
var sentences = str.replace(/\.\s+/g,'.|').replace(/\?\s/g,'?|').replace(/\!\s/g,'!|').split("|");
console.log(sentences[0]);
console.log(sentences[1]);

http://jsfiddle.net/hrRrW/

yilmazburk
  • 907
  • 9
  • 17
3

I would just change the strings and put something between each sentence. You told me you have the right to change them so it will be easier to do it this way.

\r\n

By doing this you have a string to search for and you won't need to use these complex regex.

If you want to do it the harder way I would use a regex to look for "." "?" "!" folowed by a capital letter. Like Tessi showed you.

Beejee
  • 1,836
  • 2
  • 17
  • 31
0

@Roger Poon and @Antonín Slejška 's answers work good.

It'd better if we add trim function and filter empty string:

const splitBySentence = (str) => {
  return str.replace(/([.?!])(\s)*(?=[A-Z])/g, "$1|")
    .split("|")
    .filter(sentence => !!sentence)
    .map(sentence => sentence.trim());
}

const splitBySentence = (str) => {
  return str.replace(/([.?!])(\s)*(?=[A-Z])/g, "$1|").split("|").filter(sentence => !!sentence).map(sentence => sentence.trim());
}

const content = `
The Times has identified the following reporting anomalies or methodology changes in the data for New York:

May 6: New York State added many deaths from unspecified days after reconciling data from nursing homes and other care facilities.

June 30: New York City released deaths from earlier periods but did not specify when they were from.

Aug. 6: Our database changed to record deaths by New York City residents instead of deaths that took place in New York City.

Aug. 20: New York City removed four previously reported deaths after reviewing records. The state reported four new deaths in other counties.(extracted from NY Times)
`;

console.log(splitBySentence(content));
glinda93
  • 7,659
  • 5
  • 40
  • 78