0

I'm having a little difficulty with a regex for javascript;

Heres my fiddle: http://jsfiddle.net/6yhwzap0/

The function I have created is:

var splitSentences = function(text) {
    var messy = text.match(/\(?[^\.\?\!]+[\.!\?]\)?/g);
    var clean = [];
    for(var i = 0; i < messy.length; i++) {
        var s = messy[i];
        var sTrimmed = s.trim();
        if(sTrimmed.length > 0) {
            if(sTrimmed.indexOf(' ') >= 0) {
                clean.push(sTrimmed);
            } else {
                var d = clean[clean.length - 1];
                d = d + s;

                var e = messy[i + 1];
                if(e.trim().indexOf(' ') >= 0) {
                    d = d + e;
                    i++;
                }
                clean[clean.length - 1] = d;
            }
        }
    }
    return clean;
};

I get really good results with text.match(/\(?[^\.\?\!]+[\.!\?]\)?/g); my big issue is that if a string has a quote after the period it is added to the next sentence.

So for example the following:

"Hello friend. My name is Mud." Said Mud.

Should be split into the following array:

['"Hello friend.', 'My name is Mud."', 'Said Mud.']

But instead it is the following:

['"Hello friend.', 'My name is Mud.', '" Said Mud.']

(See the quote in the 'Said Mud' string)

Can anyone help me with this OR point me to a good JavaScript library that can split text into Paragraphs, Sentences and Words? I found blast.js but I am using Angular.js and it did not integrate well at all.

RachelD
  • 4,072
  • 9
  • 40
  • 68
  • Also see http://stackoverflow.com/questions/11761563/javascript-regexp-for-splitting-text-into-sentences-and-keeping-the-delimiter. –  Dec 24 '14 at 07:36

3 Answers3

6

I suggest you to use string.match instead of string.split.

\S.*?\."?(?=\s|$)

DEMO

> var s = '"Hello friend. My name is Mud." Said Mud.'
undefined
> s.match(/\S.*?\."?(?=\s|$)/g)
[ '"Hello friend.',
  'My name is Mud."',
  'Said Mud.' ]
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • This worked for me. In my regex expression I also included URLs and this regex was able to not be tricked by the punctuation inside the URL and think it was end of a sentence. Thanks – Shane Sepac Feb 26 '21 at 21:31
2

Below is an example of how to solve your immediate problem. However, evaluating the characteristics of a sentence is obviously not the same as parsing text elements.

Regular expressions are best used in deterministic algorithms. Sentences are for the most part non-deterministic and subject to interpretation. You need a natural language processing library for that type of use case.

Natural is an NLP library for Node.js that might be a good solution for your use case. However, I haven't personally used it. YMMV.

Alchemy is another option with a full-featured NLP API avaialable as a REST web service.

Full Page Demo

RegEx Tester


var text = "If he's restin', I'll wake him up! (Shouts at the cage.) 'Ello, Mister Polly Parrot! (Owner hits the cage.) There, he moved!!!\r\n\r\nNorth Korea is accusing the U.S. government of being behind the making of the movie \"The Interview.\"\r\n\r\nAnd, in a dispatch on state media, the totalitarian regime warns the United States that U.S. \"citadels\" will be attacked, dwarfing the attack on Sony that led to the cancellation of the film's release.\r\n\r\nWhile steadfastly denying involvement in the hack, North Korea accused U.S. President Barack Obama of calling for \"symmetric counteraction.\"\r\n\r\n\"The DPRK has already launched the toughest counteraction. Nothing is more serious miscalculation than guessing that just a single movie production company is the target of this counteraction. Our target is all the citadels of the U.S. imperialists who earned the bitterest grudge of all Koreans,\" a report on state-run KCNA read.";

var splitSentences = function() {

  var pattern = /(.+?([A-Z].)[\.|\?](?:['")\\\s]?)+?\s?)/igm, match;
  var ol = document.getElementById( "result" );
  while( ( match = pattern.exec( text )) != null ) {
    if( match.index === pattern.lastIndex ) {
      pattern.lastIndex++;
    }
    var li = document.createElement( "li" );
    li.appendChild( document.createTextNode( match[0] ) );
    ol.appendChild( li );
    console.log( match[0] );
  }

}();
<ol id="result">
</ol>

The Expression

     /(.+?([A-Z].)[\.|\?](?:['")\\\s]?)+?\s?)/igm

    1st Capturing group (.+?([A-Z].)[\.|\?](?:['")\\\s]?)+?\s?)
        .+? matches any character (except newline)
        Quantifier: +? Between one and unlimited times, as few times as possible, expanding as needed [lazy]
    2nd Capturing group ([A-Z].)
        [A-Z] match a single character present in the list below
        A-Z a single character in the range between A and Z
        . matches any character (except newline)
        [\.|\?] match a single character present in the list below
        \. matches the character . literally
        \? matches the character ? literally
    (?:['")\\\s]?)+? Non-capturing group
        Quantifier: +? Between one and unlimited times, as few times as possible, expanding as needed [lazy]
        ['")\\\s] match a single character present in the list below
        '") a single character in the list '") literally (case insensitive)
        \\ matches the character \ literally
        \s match any white space character [\r\n\t\f ]          
      \s? match any white space character [\r\n\t\f ]
        Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])
g modifier: global. All matches (don't return on first match)
m modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)  
Edward J Beckett
  • 5,061
  • 1
  • 41
  • 41
  • It took me a while but I realized there is a bug with the above. Somehow it is skipping the phrase "There, he moved!!!" in the first paragraph. – RachelD Dec 27 '14 at 01:15
  • This is a situation that's very difficult to resolve. Exclamation points and ellipses are the only grammatical marks than can 'optionally' end a sentence. – Edward J Beckett Dec 27 '14 at 02:42
  • I know I've been trying to see if I could get it, it doesn't seem like this can be split by just a RegEx. I'm looking for a library. If you know of ANY it would be a huge help. Thanks so much for the very complete answer though. – RachelD Dec 27 '14 at 15:43
  • This does not work for me. – Shane Sepac Feb 26 '21 at 21:30
1

Regexp is a very blunt instrument, and is not the right way to do natural language processing, which is what this is. You'll need to find a library to do this, or write your own.

In addition to the problem you discovered with quote marks, of course you have to handle abbreviations. In addition, if your app will ever be used with other languages, you'll have to implement logic for different ways to separate sentences in each language. As I said, find a library.

You may be able to find a regexp that kinda sorta works, and then the first time another edge case comes up, such as handling nested quotes:

"When Sally said 'Regexps are not good for NLP. Write a parser', I agreed", said Bob.

Then you will spend the rest of your life fixing up your giant regexp, or more likely, run into a brick wall where you simply can't do what you want to.

  • 1
    True... RegEx isn't good for this.. every change in text would require refactoring the tokens... fun challenge though ;) – Edward J Beckett Dec 24 '14 at 07:39
  • You are correct a library would be the best solution for this. And they must exist (surly some else has already had this problem) But do you know of any JavaScript text libraries? My efforts on Google have only found Blast.js and it did not integrate well with Angular.js (as I mentioned). I would love a library suggestion. – RachelD Dec 24 '14 at 14:32