2

I'm trying to write a code that split a spaceless string into meaningful words but when I give sentence like "arealways" it returns ['a', 'real', 'ways'] and what I want is ['are', 'always'] and my dictionary contains all this words. How can I can write a code that keep backtracking till find the best matching?

the code that returns 'a', 'real', 'ways':

splitter.java:

public class splitter {

    HashMap<String, String> map = new HashMap<>();
    Trie dict;

    public splitter(Trie t) {
        dict = t;
    }

    public String split(String test) {
        if (dict.contains(test)) {
            return (test);
        } else if (map.containsKey(test)) {
            return (map.get(test));
        } else {
            for (int i = 0; i < test.length(); i++) {
                String pre = test.substring(0, i);
                if (dict.contains(pre)) {
                    String end = test.substring(i);
                    String fixedEnd = split(end);
                        if(fixedEnd != null){
                            map.put(test, pre + " " + fixedEnd);
                            return pre + " " + fixedEnd;
                        }else {
                        }
                    
                }
            }

        }
        map.put(test,null);
        return null;
    }
} 

Trie.java:

public class Trie {
    public static class TrieNode {
        private HashMap<Character, TrieNode> charMap = new HashMap<>();
        public char c;
        public boolean endOWord;
        public void insert(String s){
        }
        public boolean contains(String s){
            return true;
        }
    }
    public TrieNode root;
    
    public Trie() {
        root = new TrieNode();
    }
    
    public void insert(String s){
        TrieNode p = root;
        for(char c : s.toCharArray()) {
            if(! p.charMap.containsKey(c)) {
                TrieNode node = new TrieNode();
                node.c = c;
                p.charMap.put(c, node);
            }
            p = p.charMap.get(c);
        }
        p.endOWord = true;
    }
    public boolean contains(String s){
        TrieNode p = root;
        for(char c : s.toCharArray()) {
            if(!p.charMap.containsKey(c)) {
                return false;
            }
            p = p.charMap.get(c);
        }
        return p.endOWord;
    }
    public void insertDictionary(String filename) throws FileNotFoundException{
        File file = new File(filename);
        Scanner sc = new Scanner(file);
        while(sc.hasNextLine())
            insert(sc.nextLine());
    }
    

    public void insertDictionary(File file) throws FileNotFoundException{
        Scanner sc = new Scanner(file);
        while(sc.hasNextLine())
            insert(sc.nextLine());
    }
} 

WordSplitter class:

public class WordSplitter {

    public static void main(String[] args) throws FileNotFoundException {
            
           String test = "arealways";
           String myFile = "/Users/abc/Desktop/dictionary.txt";
           Trie dict = new Trie();
           dict.insertDictionary(myFile);
          splitter sp = new splitter(dict);
          test = sp.split(test);
          
          if(test != null)
          System.out.println(test);
          else
          System.out.println("No Splitting Found.");            
           
    }

        } 
hfontanez
  • 5,774
  • 2
  • 25
  • 37
abc
  • 23
  • 6
  • What do you mean by "best matching"? Do you want the solution with the fewest words? – Harry Jones Mar 03 '22 at 14:02
  • @HarryJones yes – abc Mar 03 '22 at 14:05
  • 2
    I am not sure if this is _THE_ answer, but one potential solution is to collect each permutation in a list and then return the list with the smallest size. – hfontanez Mar 03 '22 at 14:09
  • @hfontanez it can be but I can't write the code that keep finding all possible words, when it finds a, it keeps and looking for new words, how I can fix that? – abc Mar 03 '22 at 14:12
  • I don't think you can't because "a" is a valid word. I am more concerned with "area" and "ways", leaving the 'l' unmatched. Your solution, I assume _must_ consume all letters. Which begs the question, what happens when you can't match all letters? Like in something like "arellkjldsjlfjsd". – hfontanez Mar 03 '22 at 14:14
  • I think this question was already asked here. Does this answer your question? [How to split text without spaces into list of words](https://stackoverflow.com/questions/8870261/how-to-split-text-without-spaces-into-list-of-words) – hfontanez Mar 03 '22 at 14:25
  • @hfontanez if dictionary doesn't contain any of words of the string, it will return "no splitting found" but can't we write a code that keep looking for new words till can't split? I mean 'are, always' what I want to see, but how will I manage to do that? do you have any idea? – abc Mar 03 '22 at 14:25
  • @hfontanez I saw this link but its algorithm is too complicated for me, I can't manage to convert that to Java code so I tried new one. – abc Mar 03 '22 at 14:27
  • You need to try that algorithm and study it. Maybe you can tweak that to fit your needs. Also, I think you need to let your algorithm continue matching words and collect them in lists. Then, you return the smallest list. But I don't think you end there. You need to return the smallest list that consume the most letters. To figure that out, you must add the lengths of all strings in the list and compare it to the inputted string. I am a bit busy, but let me see if I can code it quickly. – hfontanez Mar 03 '22 at 14:29
  • What is `Trie`? And please define `map`. Make sure your code is reproducible. – hfontanez Mar 03 '22 at 14:31
  • You haven't provided what `t` contains when you pass it to the constructor. Or the main method for that matter. – hfontanez Mar 03 '22 at 15:02
  • Still missing the contents of `dictionary.txt` LOL – hfontanez Mar 03 '22 at 15:11
  • @hfontanez dictionary.txt contains words line by line what do you mean by that? – abc Mar 03 '22 at 15:13
  • We can't reproduce your problem unless we know what words are being inserted into the Trie object. – hfontanez Mar 03 '22 at 15:14
  • I used another implementation of `Trie` and I got the following results: `realways=real ways` and `arealways=a real ways`. I wasn't able to get the expected results either. I used https://www.baeldung.com/trie-java – hfontanez Mar 03 '22 at 15:18
  • If I remove the word "real" from the `Trie`, I get `realways=null` and `arealways=are always`. This is interesting. I bet that might be your problem. – hfontanez Mar 03 '22 at 15:19
  • You may also want to read this https://www.toptal.com/java/the-trie-a-neglected-data-structure – hfontanez Mar 03 '22 at 15:41
  • I improved my answer by adding `toString` methods to `Trie` and `TrieNode`. This helps visualize how the tree is built given the words in the dictionary. – hfontanez Mar 03 '22 at 16:21
  • @abc have you look into my answer? I think your expectation of what the result should be is incorrect. But I also think your Trie implementation is slightly incorrect. – hfontanez Mar 03 '22 at 17:56

1 Answers1

1

Using the OP's split method and the implementation of Trie found in The Trie Data Structure in Java Baeldung's article, I was able to get the following results:

realways=real ways
arealways=a real ways

However, if I remove the word "real" or "a" from the dictionary, I get the following results:

realways=null
arealways=are always

Here's the entire code I used to get these results:

public class Splitter {

    private static Map<String, String> map = new HashMap<>();
    private Trie dict;
    
    public Splitter(Trie t) {
        dict = t;
    }

    /**
     * @param args
     */
    public static void main(String[] args) {
        List<String> words = List.of("a", "always", "are", "area", "r", "way", "ways"); // The order of these words does not seem to impact the final result
        String test = "arealways";

        Trie t = new Trie();
        for (String word : words) {
            t.insert(word);
        }
        System.out.println(t);
        Splitter splitter = new Splitter(t);
        splitter.split(test);
        map.entrySet().forEach(System.out::println);
    }

    public String split(String test) {
        if (dict.find(test)) {
            return (test);
        } else if (map.containsKey(test)) {
            return (map.get(test));
        } else {
            for (int i = 0; i < test.length(); i++) {
                String pre = test.substring(0, i);
                if (dict.find(pre)) {
                    String end = test.substring(i);
                    String fixedEnd = split(end);
                    if (fixedEnd != null) {
                        map.put(test, pre + " " + fixedEnd);
                        return pre + " " + fixedEnd;
                    } else {
                    }

                }
            }

        }
        map.put(test, null);
        return null;
    }

    public static class Trie {
        private TrieNode root = new TrieNode();

        public boolean find(String word) {
            TrieNode current = root;
            for (int i = 0; i < word.length(); i++) {
                char ch = word.charAt(i);
                TrieNode node = current.getChildren().get(ch);
                if (node == null) {
                    return false;
                }
                current = node;
            }
            return current.isEndOfWord();
        }

        public void insert(String word) {
            TrieNode current = root;
            for (char l : word.toCharArray()) {
                current = current.getChildren().computeIfAbsent(l, c -> new TrieNode());
            }
            current.setEndOfWord(true);
        }
        
        @Override
        public String toString() {
            return toString(root);
        }

        /**
         * @param root2
         * @return
         */
        private String toString(TrieNode node) {
            return node.toString();
        }

        public static class TrieNode {
            private Map<Character, TrieNode> children = new HashMap<>() ;
            private String contents;
            private boolean endOfWord;

            public Map<Character, TrieNode> getChildren() {
                return children;
            }

            public void setEndOfWord(boolean endOfWord) {
                this.endOfWord = endOfWord;
            }

            public boolean isEndOfWord() {
                return endOfWord;
            }
            
            @Override
            public String toString() {
                StringBuilder sbuff = new StringBuilder();
                if (isLeaf()) {
                    return sbuff.toString();
                }
                children.entrySet().forEach(entry -> {
                    sbuff.append(entry.getKey() + "\n");
                });
                sbuff.append(" ");
                return children.toString();
            }
            
            private boolean isLeaf() {
                return children.isEmpty();
            }
        }
        
        public void delete(String word) {
            delete(root, word, 0);
        }

        private boolean delete(TrieNode current, String word, int index) {
            if (index == word.length()) {
                if (!current.isEndOfWord()) {
                    return false;
                }
                current.setEndOfWord(false);
                return current.getChildren().isEmpty();
            }
            char ch = word.charAt(index);
            TrieNode node = current.getChildren().get(ch);
            if (node == null) {
                return false;
            }
            
            boolean shouldDeleteCurrentNode = delete(node, word, index + 1) && !node.isEndOfWord();

            if (shouldDeleteCurrentNode) {
                current.getChildren().remove(ch);
                return current.getChildren().isEmpty();
            }
            return false;
        }
    }
}

I improved the original code by adding a toString() method to the Trie and TrieNode. Now, when I print out the Trie object "t", I get the following result:

{a={r={e={a=}}, l={w={a={y={s=}}}}}, w={a={y={s=}}}}

My conclusion is that the OP's TrieNode implementation is incorrect. The way the Trie is built, given the inputted string value, the behavior described by the OP seems to be correct.

hfontanez
  • 5,774
  • 2
  • 25
  • 37