11

I am trying to implement a program that will take a users input, split that string into tokens, and then search a dictionary for the words in that string. My goal for the parsed string is to have every single token be an English word.

For Example:

Input:
       aman

Split Method:
      a man
      a m an
      a m a n
      am an
      am a n
      ama n

Desired Output:
      a man

I currently have this code which does everything up until the desired output part:

    import java.util.Scanner;
import java.io.*;

public class Words {

    public static String[] dic = new String[80368];

    public static void split(String head, String in) {

        // head + " " + in is a segmentation 
        String segment = head + " " + in;

        // count number of dictionary words
        int count = 0;
        Scanner phraseScan = new Scanner(segment);
        while (phraseScan.hasNext()) {
            String word = phraseScan.next();
            for (int i=0; i<dic.length; i++) {
                if (word.equalsIgnoreCase(dic[i])) count++;
            }
        }

        System.out.println(segment + "\t" + count + " English words");

        // recursive calls
        for (int i=1; i<in.length(); i++) {
            split(head+" "+in.substring(0,i), in.substring(i,in.length()));
        }   
    }

    public static void main (String[] args) throws IOException {
        Scanner scan = new Scanner(System.in);
        System.out.print("Enter a string: ");
        String input = scan.next();
        System.out.println();

        Scanner filescan = new Scanner(new File("src:\\dictionary.txt"));
        int wc = 0;
        while (filescan.hasNext()) {
            dic[wc] = filescan.nextLine();
            wc++;
        }

        System.out.println(wc + " words stored");

        split("", input);

    }
}

I know there are better ways to store the dictionary (such as a binary search tree or a hash table), but I don't know how to implement those anyway.

I am stuck on how to implement a method that would check the split string to see if every segment was a word in the dictionary.

Any help would be great, Thank you

Brendan Lesniak
  • 2,271
  • 4
  • 24
  • 48

3 Answers3

19

Splitting the input string every possible way is not going to finish in a reasonable amount of time if you want to support 20 or more characters. Here's a more efficient approach, comments inline:

public static void main(String[] args) throws IOException {
    // load the dictionary into a set for fast lookups
    Set<String> dictionary = new HashSet<String>();
    Scanner filescan = new Scanner(new File("dictionary.txt"));
    while (filescan.hasNext()) {
        dictionary.add(filescan.nextLine().toLowerCase());
    }

    // scan for input
    Scanner scan = new Scanner(System.in);
    System.out.print("Enter a string: ");
    String input = scan.next().toLowerCase();
    System.out.println();

    // place to store list of results, each result is a list of strings
    List<List<String>> results = new ArrayList<>();

    long time = System.currentTimeMillis();

    // start the search, pass empty stack to represent words found so far
    search(input, dictionary, new Stack<String>(), results);

    time = System.currentTimeMillis() - time;

    // list the results found
    for (List<String> result : results) {
        for (String word : result) {
            System.out.print(word + " ");
        }
        System.out.println("(" + result.size() + " words)");
    }
    System.out.println();
    System.out.println("Took " + time + "ms");
}

public static void search(String input, Set<String> dictionary,
        Stack<String> words, List<List<String>> results) {

    for (int i = 0; i < input.length(); i++) {
        // take the first i characters of the input and see if it is a word
        String substring = input.substring(0, i + 1);

        if (dictionary.contains(substring)) {
            // the beginning of the input matches a word, store on stack
            words.push(substring);

            if (i == input.length() - 1) {
                // there's no input left, copy the words stack to results
                results.add(new ArrayList<String>(words));
            } else {
                // there's more input left, search the remaining part
                search(input.substring(i + 1), dictionary, words, results);
            }

            // pop the matched word back off so we can move onto the next i
            words.pop();
        }
    }
}

Example output:

Enter a string: aman

a man (2 words)
am an (2 words)

Took 0ms

Here's a much longer input:

Enter a string: thequickbrownfoxjumpedoverthelazydog

the quick brown fox jump ed over the lazy dog (10 words)
the quick brown fox jump ed overt he lazy dog (10 words)
the quick brown fox jumped over the lazy dog (9 words)
the quick brown fox jumped overt he lazy dog (9 words)

Took 1ms
erickson
  • 265,237
  • 58
  • 395
  • 493
WhiteFang34
  • 70,765
  • 18
  • 106
  • 111
  • Another way would be to **store the words in a database**. This will increase performance when working with huge numbers of words (> 4 million). – Alba Mendez May 15 '11 at 11:06
  • @jmendeth: sure, a database could help if the dictionary was large enough and there wasn't enough memory available. Most dictionaries aren't that large however. The larger one I tested with has over 400k words and requires 38MB. The OP doesn't need a database since his dictionary has 80k words and only consumes around 7MB. For a huge number of words I'd probably try using a different data structure like a trie before going to a database. A database would work fine though, in the 36 character example input I gave there are only 335 lookups. – WhiteFang34 May 15 '11 at 11:41
  • You're right, but sometimes (not in this case) dictionaries of other languages/characters can be about 10 Million words. – Alba Mendez May 15 '11 at 11:45
  • Is there anyway to implement a binary search tree instead of a HashSet? Ty for your answer – Brendan Lesniak May 15 '11 at 15:20
  • 1
    Binary search tree would give you O(lg(n)) search time instead of O(1), so that's not so hot. A trie on letters though would make it possible to implement startsWith or similar. With the current implementation, if you are given a three petabyte string that happens not to start with a word, "xzaszssxaa..." then you'll scan the entire string, repeatedly looking for the longer and longer substrings in the dictionary, instead of quickly figuring out that it's not there. With a trie implementation, you'd stop early. – Gregory Marton Jun 15 '11 at 17:25
1

If my answer seems silly, it's because you're really close and I'm not sure where you're stuck.

The simplest way given your code above would be to simply add a counter for the number of words and compare that to the number of matched words

    int count = 0; int total = 0;
    Scanner phraseScan = new Scanner(segment);
    while (phraseScan.hasNext()) {
        total++
        String word = phraseScan.next();
        for (int i=0; i<dic.length; i++) {
            if (word.equalsIgnoreCase(dic[i])) count++;
        }
    }
    if(total==count) System.out.println(segment);

Implementing this as a hash-table might be better (it's faster, for sure), and it'd be really easy.

HashSet<String> dict = new HashSet<String>()
dict.add("foo")// add your data


int count = 0; int total = 0;
Scanner phraseScan = new Scanner(segment);
while (phraseScan.hasNext()) {
    total++
    String word = phraseScan.next();
    if(dict.contains(word)) count++;
}

There are other, better ways to do this. One is a trie (http://en.wikipedia.org/wiki/Trie) which is a bit slower for lookup but stores data more efficiently. If you have a large dictionary, you might not be able ot fit it in memory, so you could use a database or key-value store like a BDB (http://en.wikipedia.org/wiki/Berkeley_DB)

dfb
  • 13,133
  • 2
  • 31
  • 52
0

package LinkedList;

import java.util.LinkedHashSet;

public class dictionaryCheck {

private static LinkedHashSet<String> set;
private static int start = 0;
private static boolean flag;

public boolean checkDictionary(String str, int length) {

    if (start >= length) {
        return flag;
    } else {
        flag = false;
        for (String word : set) {

            int wordLen = word.length();

            if (start + wordLen <= length) {

                if (word.equals(str.substring(start, wordLen + start))) {
                    start = wordLen + start;
                    flag = true;
                    checkDictionary(str, length);

                }
            }
        }

    }

    return flag;
}

public static void main(String[] args) {
    // TODO Auto-generated method stub
    set = new LinkedHashSet<String>();
    set.add("Jose");
    set.add("Nithin");
    set.add("Joy");
    set.add("Justine");
    set.add("Jomin");
    set.add("Thomas");
    String str = "JoyJustine";
    int length = str.length();
    boolean c;

    dictionaryCheck obj = new dictionaryCheck();
    c = obj.checkDictionary(str, length);
    if (c) {
        System.out
                .println("String can be found out from those words in the Dictionary");
    } else {
        System.out.println("Not Possible");
    }

}

}

  • Simple and Effective Solution. Let me know if I miss something. It's time complexity is exponential I guess. The polynomial time complexity can be achieved by using Dynamic Programming Solution. – Justin Jose Sep 30 '15 at 21:08
  • While this code may solve the OP's problem, you should really add some explanation about what the code does, or how it does it. _Just Code_ answers are frowned upon. – BrokenBinary Sep 30 '15 at 21:19