1

Given a string such :

 var str = "thisisinsane";

assisted by a list of words from a dictionary such:

 var dic = [ "insane", "i", "is", "sin", "in", "this", "totally" ];

How to split str into words?

For this string, there are 3 words to identify. But we need avoid the pitfalls. To avoid them most of the time, I know we can attack the sentence by the left, and try to find the longest word prossible. When found, we could attack the rest of the string, etc.

Below : the input, the possible pitfalls, and the wanted output in bottom right.

                      thisisinsane
                          |
                          |
                     (this)isinsane
                     /            \
                    /              \
          (this,i)sinsane         (this,is)insane
              /                     /           \
             /                     /             \
  (this,i,sin)ane          (this,is,in)sane    (this,is,insane)
                                /                   <BEST IS>
                               /                    <THIS ONE>
                       (this,is,in,sane)

At the end, we want to get :

 var splited = ["this", "is", "insane"];
Hugolpz
  • 17,296
  • 26
  • 100
  • 187
  • Maybe dig around http://stackoverflow.com/questions/5834371/ – Hugolpz Nov 22 '13 at 23:53
  • Just curious, what are you doing this for? spellchecker? – stevebot Nov 22 '13 at 23:56
  • @stevebot: for chinese text-segmentation. I know the direction but I have no idea how we can attack "the longuest string". (beginner in JS) – Hugolpz Nov 22 '13 at 23:58
  • @JPod: I want to find the longest words possible in the rest of `str` without re-using the letters from previously found words. + I expanded a bit the `var dic` to illustrate better the pitfalls we face. – Hugolpz Nov 23 '13 at 00:06
  • Man I don't know how to post an answer for this and I am not sure you will get one. There are a lot of text segmentation libraries out there for other languages that you can use as a starting point. – stevebot Nov 23 '13 at 00:17

1 Answers1

1

This is a quick implementation that will do the search from left to right and match longest words in the dictionary first (jsfiddle). However, I'm not so sure it's very clever to implement this on your own as it sounds like a complicated field and even without any knowledge on the subject I can tell that this algorithm is flawed to begin with. You might be better off looking out for existing libraries if there are any.

Needless to say, this was just typed together quickly. It is not optimized for performance in any way (it uses recursion, which is really not at all necessary), and also not extensively tested. It works on your example data, though, and on some variations I tested. I like to leave some of the work to the OP in case I give full code examples, so feel free to improve this if you want to use it.

var splitByDictionary = function (input, dictionary) {
    "use strict";

    // make sure we're going to look for longest-possible matches first
    dictionary.sort( function (a, b) {
        return b.length - a.length;
    } );

    var foundWords = [],
        remaining = input;

    var result = (function match () {
        if( remaining.length === 0 ) {
            return true;
        }

        for( var i = 0; i < dictionary.length; i++ ) {
            if( remaining.substr( 0, dictionary[i].length ) === dictionary[i] ) {
                foundWords.push( dictionary[i] );
                remaining = remaining.substr( dictionary[i].length );

                return match();
            }
        }

        return false;
    })();

    return result ? foundWords : null;
};

var splitted = splitByDictionary( "thisisinsane", ["insane", "i", "is", "sin", "in", "this", "totally"] );
console.log( splitted ); // ["this", "is", "insane"]
Ingo Bürk
  • 19,263
  • 6
  • 66
  • 100