-4

I have been trying to build the algorithm but I'm unable to find the solution. My sentence doesn't have any delimiters to split given sentence into multiple strings.

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • i think you need to use nltk parser. – Avinash Raj Apr 14 '15 at 02:16
  • 1
    well google can do it in google search , (if u go and type that word it breaks up as gives list of possible with space) so it is possible , BUT **question is what level you want to try at for that u need to post what u have tired** – Srinath Ganesh Apr 14 '15 at 02:25
  • 1
    possible duplicate of [How can I split multiple joined words?](http://stackoverflow.com/questions/195010/how-can-i-split-multiple-joined-words) – Abhishek Apr 14 '15 at 03:21

4 Answers4

1
import java.util.TreeSet;

public class YourHomework {

    // Dictionary
    private static final TreeSet<String> dic = new TreeSet<String>() {
        {
            add("this");
            add("is");
            add("a");
            add("sentence");
        }
    };

    private static String convert(String input) {
        StringBuilder output = new StringBuilder();
        StringBuilder word = new StringBuilder();

        for (char c : input.toCharArray()) {
            // Each new char is added to the current word
            word.append(c);

            // If we have a match with the current word
            if (dic.contains(word.toString())) { // We will append it to our output
                if (output.length() > 0) { // But first we add a space if it's not the first word
                    output.append(' ');
                }

                // We add the word to the output sentence
                output.append(word);
                word.setLength(0);
            }
        }

        return output.toString();
    }

    public static void main(String[] args) {
        System.out.println(convert("thisisasentence"));
    }

}
Florent
  • 1,311
  • 1
  • 14
  • 15
0

I'm guessing this is Computer Science homework (and not specifically natural language processing). I'll share an approach rather than provide a solution:

  1. Start with your input sentence
  2. In a loop, take N characters from the start of your sentence (start with 1, then 2, etc...)
  3. Check the N characters you just took off the start of your sentence against your dictionary of valid words.
  4. If you find a match, output that word and remove N characters from the input sentence. Resume at step 2, unless you have reached the end of the sentence (in which case you are done).
  5. If you do not find a match, you are done. The next word did not match anything in your dictionary.

Note that casing is important with a dictionary. If everything is lower case you're fine, but if ThisisJohn is valid input, you need to give some thought to how to handle casing.

If you are attempting general case natural language processing, the solution is significantly more complex.

Eric J.
  • 147,927
  • 63
  • 340
  • 553
  • 5: you're not done. You need to backtrack in case your earlier word boundary guesses were incorrect. – Thilo Apr 14 '15 at 02:22
  • @Thilo: Explain that? How do you miss a word boundary when testing every possible substring? (Unless there is a word-not-in-the-dictionary followed by some that are... but the behavior is not defined by the question). – Eric J. Apr 14 '15 at 02:26
  • 1
    Even if casing isn't important, how is the program supposed to know to parse "this is" or "this i"? Even in the OP's question, how are you supposed to determine to parse "this is a" or "this is as"? Because both examples have valid words in a dictionary. This question is more like an autocorrect with built in grammar correction. Much like smart phones, or word processors and even those are not entirely accurate. – Shar1er80 Apr 14 '15 at 02:27
  • @Shar1er80: Sure, but the problem reads much more like CS 101 and much less like NLP. – Eric J. Apr 14 '15 at 02:28
0

Here's the solution I improvised in Python:

dictionary = set([
    'this',
    'is',
    'a',
    'i',
    'sentence',
])
match_bkt = []
# ip_str = 'sentence'
# ip_str = 'thisisnotasentence'
ip_str = 'thisisasentence'
# ip_str = 'isa'


def sentence_chk(ip_list):
    for i in xrange(1, len(ip_list) + 1):
        sub_str = ''.join(ip_list[:i])
        if sub_str in dictionary:
            # push to match_bkt
            match_bkt.append(sub_str)
            if len(ip_list[i:]) > 0:
                if sentence_chk(ip_list[i:]) is not True:
                    # pop match_bkt & backtrack
                    match_bkt.pop()
                    continue
                else:
                    sentence_chk(ip_list[i:])
        if len(''.join(match_bkt)) == len([o for o in ip_str]):
            print ip_str, '-- is a sentence'
            exit()

ip_list = [i for i in ip_str]
if sentence_chk(ip_list) is None:
    print ip_str, '-- is not a sentence'


#I'm sure this can be further optimized

Gist Link: https://gist.github.com/svalleru/bbaae89dbee974b2ba6b

Shan Valleru
  • 3,093
  • 1
  • 22
  • 21
0

In brief, to split a string, try getting first part from input using different length, and if it is a valid word in dictionary, split the remaining part and construct the result strings accordingly.

Simple recursive thought (psuedo code):

List<String> split(String sentence) {
    result = new List<String>();
    if (sentence is empty) {
        return result;
    }
    for (i = 1; i < sentence.length; i++) {
        firstPart = sentence.substring(i)
        if ( s is in dictionary) {
            remaining = split(sentence.substring(i+1 .. end))
            foreach s in remaining {
                result.add(firstPart  + " " + s)
            }
        }
    }
    return result;
}
Adrian Shum
  • 38,812
  • 10
  • 83
  • 131