3

I am working on a problem, which is to write a program to find the longest word made of other words in a list of words.

EXAMPLE
Input: test, tester, testertest, testing, testingtester
Output: testingtester

I searched and find the following solution, my question is I am confused in step 2, why we should break each word in all possible ways? Why not use each word directly as a whole? If anyone could give some insights, it will be great.

The solution below does the following:

  1. Sort the array by size, putting the longest word at the front
  2. For each word, split it in all possible ways. That is, for “test”, split it into {“t”, “est”}, {“te”, “st”} and {“tes”, “t”}.
  3. Then, for each pairing, check if the first half and the second both exist elsewhere in the array.
  4. “Short circuit” by returning the first string we find that fits condition #3.
Jongware
  • 22,200
  • 8
  • 54
  • 100
Lin Ma
  • 9,739
  • 32
  • 105
  • 175

5 Answers5

6

Answering your question indirectly, I believe the following is an efficient way to solve this problem using tries.

Build a trie from all of the words in your string.

Sort the words so that the longest word comes first.

Now, for each word W, start at the top of the trie and begin following the word down the tree one letter at a time using letters from the word you are testing.

Each time a word ends, recursively re-enter the trie from the top making a note that you have "branched". If you run out of letters at the end of the word and have branched, you've found a compound word and, because the words were sorted, this is the longest compound word.

If the letters stop matching at any point, or you run out and are not at the end of the word, just back track to wherever it was that you branched and keep plugging along.

I'm afraid I don't know Java that well, so I'm unable to provide you sample code in that language. I have, however, written out a solution in Python (using a trie implementation from this answer). Hopefully it is clear to you:

#!/usr/bin/env python3

#End of word symbol
_end = '_end_'

#Make a trie out of nested HashMap, UnorderedMap, dict structures
def MakeTrie(words):
  root = dict()
  for word in words:
    current_dict = root
    for letter in word:
      current_dict = current_dict.setdefault(letter, {})
    current_dict[_end] = _end
  return root

def LongestCompoundWord(original_trie, trie, word, level=0):
  first_letter = word[0]
  if not first_letter in trie:
    return False
  if len(word)==1 and _end in trie[first_letter]:
    return level>0
  if _end in trie[first_letter] and LongestCompoundWord(original_trie, original_trie, word[1:], level+1):
    return True
  return LongestCompoundWord(original_trie, trie[first_letter], word[1:], level)

#Words that were in your question
words = ['test','testing','tester','teste', 'testingtester', 'testingtestm', 'testtest','testingtest']

trie = MakeTrie(words)

#Sort words in order of decreasing length
words = sorted(words, key=lambda x: len(x), reverse=True)

for word in words:
  if LongestCompoundWord(trie,trie,word):
    print("Longest compound word was '{0:}'".format(word))
    break

With the above in mind, the answer to your original question becomes clearer: we do not know ahead of time which combination of prefix words will take us successfully through the tree. Therefore, we need to be prepared to check all possible combinations of prefix words.

Since the algorithm you found does not have an efficient way of knowing what subsets of a word are prefixes, it splits the word at all possible points in word to ensure that all prefixes are generated.

Community
  • 1
  • 1
Richard
  • 56,349
  • 34
  • 180
  • 251
  • 1
    Thanks Richard, I know Tries. And confused what do you mean "Each time the tree branches, recursively re-enter the trie from the top"? For an example? – Lin Ma Oct 07 '15 at 06:54
  • @LinMa, I've tried to clarify the text of my answer and have added a code sample to help clarify. – Richard Oct 07 '15 at 07:53
  • @LinMa, as I've noted at the end of my answer, if you understand how the trie algorithm I've provided works, it will give you a lot of in-sight into how the algorithm you found works. It should also convince you that the algorithm you found is rather inefficient. – Richard Oct 07 '15 at 07:56
  • I see a potential optimisation: If you have previously determined that the suffix of the current word that begins at some position i *cannot* be decomposed into a concatenation of other words, then this information can be stored and possibly reused. This will save computation if there is more than one way of decomposing some i-letter prefix of the word. E.g. if the current word is `abcdefg` and the other words are `abc`, `def`, `ab` and `cdef`, then after successfully decomposing the prefix `abcdef` into `abc|def`, we find that the remainder, `g`, can't be decomposed... – j_random_hacker Oct 07 '15 at 14:13
  • ... We then try the decomposition `ab|cdef`, which also leaves the same suffix, `g`: with your current algorithm, this will be reevaluated, even though we already know that this subproblem can't be solved. In this example, not much computation is saved, but in general there are inputs that can lead to exponential time due to many independent ways of decomposing segments, all of which are ultimately futile -- consider instead the word containing n copies of `abcdef` followed by `g`, whose initial part can be decomposed in 2^n different ways. – j_random_hacker Oct 07 '15 at 14:16
  • @j_random_hacker: I'm not entirely sure I follow, but the thought seems valuable. If you want to add an amendment to the answer demonstrating your idea with code, go for it! – Richard Oct 07 '15 at 19:51
  • @Richard, studied your solution and want to clarify basics, for one word making of other words, your understanding is there cannot be any gaps? For example, we cannot say HelloTTTWorld is made of Hello and World? And no overlap, saying Hellold is not makde of Hello and old? Thanks. – Lin Ma Oct 07 '15 at 21:43
  • 1
    That's correct, @LinMa. If you need overlaps, that would require a different algorithm. If you need gaps, that would be yet a different algorithm. – Richard Oct 07 '15 at 22:05
  • @Richard, studied your code, and get lost what does this line means "current_dict = current_dict.setdefault(letter, {})"? And it seems in your code, you never index the real character one by one in each word? Please feel free to correct me. – Lin Ma Oct 08 '15 at 00:51
  • 1
    @LinMa, it either returns the dictionary (hashmap) referred to be letter, or associates letter with a new dictionary (hashmap) and returns that. The details of that part of the algorithm are probably not too important, other than that it builds up a trie. I don't understand the second part of your comment. – Richard Oct 08 '15 at 01:18
  • Thanks Richard, sorry for any unclear comments. (1) I mean for this statement you are using, "current_dict = current_dict.setdefault(letter, {})", I do not see "returns the dictionary (hashmap) referred to be letter", I see you always create something empty into dictionary? I think you should check if the character already exists in the dictionary? (2) And the key of the dictionary is character in a word, what is the value of the dictionary? – Lin Ma Oct 08 '15 at 01:23
  • 1
    Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/91680/discussion-between-richard-and-lin-ma). – Richard Oct 08 '15 at 01:45
  • 1
    @Richard: I've added an answer that builds on yours, where I go into more depth. Let me know if it's not clear! – j_random_hacker Oct 08 '15 at 17:10
  • @Richard, thanks for all the help, mark as answered. – Lin Ma Oct 12 '15 at 19:08
4

Richard's answer will work well in many cases, but it can take exponential time: this will happen if there are many segments of the string W, each of which can be decomposed in multiple different ways. For example, suppose W is abcabcabcd, and the other words are ab, c, a and bc. Then the first 3 letters of W can be decomposed either as ab|c or as a|bc... and so can the next 3 letters, and the next 3, for 2^3 = 8 possible decompositions of the first 9 letters overall:

a|bc|a|bc|a|bc
a|bc|a|bc|ab|c
a|bc|ab|c|a|bc
a|bc|ab|c|ab|c
ab|c|a|bc|a|bc
ab|c|a|bc|ab|c
ab|c|ab|c|a|bc
ab|c|ab|c|ab|c

All of these partial decompositions necessarily fail in the end, since there is no word in the input that contains W's final letter d -- but his algorithm will explore them all before discovering this. In general, a word consisting of n copies of abc followed by a single d will take O(n*2^n) time.

We can improve this to O(n^2) worst-case time (at the cost of O(n) space) by recording extra information about the decomposability of suffixes of W as we go along -- that is, suffixes of W that we have already discovered we can or cannot match to word sequences. This type of algorithm is called dynamic programming.

The condition we need for some word W to be decomposable is exactly that W begins with some word X from the set of other words, and the suffix of W beginning at position |X|+1 is decomposable. (I'm using 1-based indices here, and I'll denote a substring of a string S beginning at position i and ending at position j by S[i..j].)

Whenever we discover that the suffix of the current word W beginning at some position i is or is not decomposable, we can record this fact and make use of it later to save time. For example, after testing the first 4 decompositions in the 8 listed earlier, we know that the suffix of W beginning at position 4 (i.e., abcabcd) is not decomposable. Then when we try the 5th decomposition, i.e., the first one starting with ab, we first ask the question: Is the rest of W, i.e. the suffix of W beginning at position 3, decomposable? We don't know yet, so we try adding c to get ab|c, and then we ask: Is the rest of W, i.e. the suffix of W beginning at position 4, decomposable? And we find that it has already been found not to be -- so we can immediately conclude that no decomposition of W beginning with ab|c is possible either, instead of having to grind through all 4 possibilities.

Assuming for the moment that the current word W is fixed, what we want to build is a function f(i) that determines whether the suffix of W beginning at position i is decomposable. Pseudo-code for this could look like:

- Build a trie the same way as Richard's solution does.
- Initialise the array KnownDecomposable[] to |W| DUNNO values.

f(i):
    - If i == |W|+1 then return 1.  (The empty suffix means we're finished.)
    - If KnownDecomposable[i] is TRUE or FALSE, then immediately return it.
    - MAIN BODY BEGINS HERE
    - Walk through Richard's trie from the root, following characters in the
      suffix W[i..|W|].  Whenever we find a trie node at some depth j that
      marks the end of a word in the set:
        - Call f(i+j) to determine whether the rest of W can be decomposed.
        - If it can (i.e. if f(i+j) == 1):
            - Set KnownDecomposable[i] = TRUE.
            - Return TRUE.
    - If we make it to this point, then we have considered all other
      words that form a prefix of W[i..|W|], and found that none of
      them yield a suffix that can be decomposed.
    - Set KnownDecomposable[i] = FALSE.
    - Return FALSE.

Calling f(1) then tells us whether W is decomposable.

By the time a call to f(i) returns, KnownDecomposable[i] has been set to a non-DUNNO value (TRUE or FALSE). The main body of the function is only run if KnownDecomposable[i] is DUNNO. Together these facts imply that the main body of the function will only run as many times as there are distinct values i that the function can be called with. There are at most |W|+1 such values, which is O(n), and outside of recursive calls, a call to f(i) takes at most O(n) time to walk through Richard's trie, so overall the time complexity is bounded by O(n^2).

Community
  • 1
  • 1
j_random_hacker
  • 50,331
  • 10
  • 105
  • 169
  • Thanks j_random_hacker, I debugged Trie Tree code today and make hands dirty. One quick question, for example, if I want to build Trie tree for word Python, I think I should build trie/suffix tree like (Python, ython, thon, ton, on and n), so that for any input string (for example, ython), I can know if the string matches a part of the original long string (e.g. Python). In the example code, it seems it only build one logical structure Python and wondering in your code, how do you match string ython, which is a suffix of original word Python? – Lin Ma Oct 09 '15 at 03:38
  • @LinMa: I'm "borrowing" Richard's trie technique. You don't need to build a suffix tree (or suffix trie) for this -- you just "follow" the letters in the current word through the trie, exactly as he does. All I'm really adding on top of his solution is a DP technique called *memoization* that stores and reuses the solutions to subproblems instead of recomputing them. – j_random_hacker Oct 12 '15 at 11:40
2

I guess you are just making a confusion about which words are split.

After sorting, you consider the words one after the other, by decreasing length. Let us call a "candidate" a word you are trying to decompose.

If the candidate is made of other words, it certainly starts with a word, so you will compare all prefixes of the candidate to all possible words.

During the comparison step, you compare a candidate prefix to the whole words, not to split words.


By the way, the given solution will not work for triwords and longer. The fix is as follows:

  • try every prefix of the candidate and compare it to all words
  • in case of a match, repeat the search with the suffix.

Example:

testingtester gives the prefixes

t, te, tes, test, testi, testin, testing, testingt, testingte, testingtes and testingteste

Among these, test and testing are words. Then you need to try the corresponding suffixes ingtester and tester.

ingtester gives

i, in, ing, ingt, ingte, ingtes, ingtest and ingteste, none of which are words.

tester is a word and you are done.


IsComposite(InitialCandidate, Candidate):
    For all Prefixes of Candidate:
        if Prefix is in Words:
            Suffix= Candidate - Prefix
            if Suffix == "":
                return Candidate != InitialCandidate
            else:
                return IsComposite(InitialCandidate, Suffix)

For all Candidate words by decreasing size:
    if IsComposite(Candidate, Candidate):
        print Candidate
        break
  • Thanks Yves, let me confirm my understanding is correct. You mean for candidate like "testingtester", we should break it into prefix? If my understanding is correct, what are the prefix do you mean in this example? – Lin Ma Oct 07 '15 at 06:59
  • Hi Yves, not quite understand this step "in case of a match, repeat the search with the suffix", why the same prefix cannot be reused or overlap with other prefix? For overlap I mean, saying A[0:3] may match a word, and A[0:4] may match another. – Lin Ma Oct 07 '15 at 07:02
  • 1
    You need to continue the search with all matching prefixes. –  Oct 07 '15 at 07:08
  • Thanks Yves, why original solution triwords not working? An example is appreciated. :) – Lin Ma Oct 07 '15 at 07:54
  • 1
    Because the "second half" is a biword, not a word. Try `testtesttesting`. –  Oct 07 '15 at 07:56
  • Thanks Yves, for bi-word, I think you mean two concatenated words? – Lin Ma Oct 07 '15 at 21:18
  • Hi Yves, studied your solution and want to clarify basics, for one word making of other words, your understanding is there cannot be any gaps? For example, we cannot say HelloTTTWorld is made of Hello and World? And no overlap, saying Hellold is not makde of Hello and old? Thanks. – Lin Ma Oct 07 '15 at 21:44
  • BTW, Yves, not quite understand what is the logical meaning of statement, "Candidate != InitialCandidate"? Your guidance is appreciated. – Lin Ma Oct 08 '15 at 01:48
  • 1
    @LinMa: execute by hand on examples and you'll see. –  Oct 08 '15 at 06:36
  • Hi Yves, did some work today and I think "Candidate != InitialCandidate", you want to avoid the string to match itself? Thanks. – Lin Ma Oct 09 '15 at 03:40
  • Hi Yves, thanks for the confirmation. Do you think "Short circuit" posted in my original code works more efficiently than your implementation? :) – Lin Ma Oct 09 '15 at 07:43
2

I would probably use recursion here. Start with the longest word and find words it starts with. For any such word remove it from the original word and continue with the remaining part in the same manner.

Pseudo code:

function iscomposed(orininalword, wordpart)
  for word in allwords
    if word <> orininalword
      if wordpart = word
        return yes
      elseif wordpart starts with word
        if iscomposed(orininalword, wordpart - word)
          return yes
        endif
      endif
    endif
  next
  return no
end  

main
  sort allwords by length descending
  for word in allwords
    if iscomposed(word, word) return word
  next
end

Example:

words:
abcdef
abcde
abc
cde
ab

Passes:

1. abcdef starts with abcde. rest = f. 2. no word f starts with found.
1. abcdef starts with abc. rest = def. 2. no word def starts with found.
1. abcdef starts with ab. rest = cdef. 2. cdef starts with cde. rest = f. 3. no word f starts with found.
1. abcde starts with abc. rest = cde. 2. cde itself found. abcde is a composed word
Thorsten Kettner
  • 89,309
  • 7
  • 49
  • 73
  • Thanks Thorsten for the details, studied your solution and want to clarify basics, for one word making of other words, your understanding is there cannot be any gaps? For example, we cannot say HelloTTTWorld is made of Hello and World? And no overlap, saying Hellold is not makde of Hello and old? Thanks. – Lin Ma Oct 07 '15 at 21:43
  • Hi Thorsten, not quite catch what is the logical meaning of statement "wordpart <> orininalword and wordpart = word"? your guidance is appreciated. Thanks. – Lin Ma Oct 08 '15 at 01:51
  • 1
    No, there cannot be any gaps. 'HelloTTTWorld' consists of three parts, 'Hello', 'TTT', and 'World'. Only if 'TTT' is a word in your list, I would consider the whole as a composed word. Otherwise you would interprete 'function' as 'fun' and 'on' with a "gap". And no, there cannot be overlaps. I think it is obvious that 'done' isn't composed of 'do' and 'one'. By the way: Be aware that all this is very technical. You will still identify words as composed that are not: do-main, torn-a-do, ... – Thorsten Kettner Oct 08 '15 at 05:41
  • 1
    `word = wordpart` should be clear. Once we found 'fingerprint' to start with 'finger', we will check the remaining wordpart 'print'. There is a word 'print' in allwords, so we find a `word = wordpart` and know that 'fingerprint' is a composed word. With every call of iscomposed we name the original word 'fingerprint' too, however, in order to check whether we are dealing with the whole word or just part thereof. Otherwise every word would be identified as composed, because it would always match itself. (The `<>` is supposed to mean "not equal"). – Thorsten Kettner Oct 08 '15 at 05:51
  • 1
    So: We start with 'fingerprint'. We find 'fingerprint' in allwords, but know we can ignore this (as it is the original word), then we find 'finger' (which is not the original word) and work on this. (However, I just notice, I forgot to do the same in the `elseif`, so I'll correct this now.) – Thorsten Kettner Oct 08 '15 at 05:52
  • Thanks Thorsten, want to confirm my understanding is correct, in your method iscomposed(orininalword, wordpart), orininalword means the whole word candidate and wordpart means the part which has not been check whether matched? Thanks. – Lin Ma Oct 08 '15 at 05:58
  • 1
    The orininalword is only given to be excluded from the list when searching for matches. We could as well use a flag, or for better performance even the position in allwords, so as to start searching after that position only. First call: iscomposed('fingerprint','fingerprint'). We must not return yes when we find the word"part" in the list, as 'fingerprint' is not actually a part of 'fingerprint'. Second call: iscomposed('fingerprint','print'). Here we must return yes, when we find 'print' in the list, as 'print' is a real wordpart of 'fingerprint'. – Thorsten Kettner Oct 08 '15 at 06:04
  • Thanks Thorsten, for your comments, "start searching after that position only", how to find the position to search from? A composed word could be composed by shorter word, then longer word, and could be vice versa. :) – Lin Ma Oct 08 '15 at 06:06
  • BTW, Thorsten, do you understand the original "Short circuit" solution? It sounds a more efficient implementation without recursive? Thanks. – Lin Ma Oct 08 '15 at 06:07
  • 1
    The array allwords is sorted by length. If we want to examine 'fingerprint', we can give the funtion its position in the array. So the function would no longer check whether fingerprint starts with itself or even any of the longer words. We would have a loop starting from, say, 'fingernail' instead. (Well, it was a spontaneous thought. Just an alternative. It may be much better to have the array sorted by name and do a binary search then. But that's details already I haven't given much thought.) – Thorsten Kettner Oct 08 '15 at 16:52
  • 1
    As to the "short circuit" solution: Yes, I think I understand it. You split 'fingerprint' in all possible parts and exclude all except {'finger','print'}, because this is the only combination where both the first part and the last part are words. However, I think your example is over-simplifying things. A word can consist of two or more words. So you would have to split 'fingerprint' in pairs of two, three, four, five ... That must be hundreds of combinations. And for each combination you would have to look up words in your list. Hundreds of searches in order to make the "short circuit". – Thorsten Kettner Oct 08 '15 at 16:55
  • thanks a lot and wondering what means "short circuit" in your mind? Any sub string in the original string which is a legal word? – Lin Ma Oct 09 '15 at 00:31
  • Hi Thorsten, I debugged Trie Tree code today and make hands dirty. One quick question, for example, if I want to build Trie tree for word Python, I think I should build trie/suffix tree like (Python, ython, thon, ton, on and n), so that for any input string (for example, ython), I can know if the string matches a part of the original long string (e.g. Python). In the example code, it seems it only build one logical structure Python and wondering in your code, how do you match string ython, which is a suffix of original word Python? – Lin Ma Oct 09 '15 at 03:25
  • 1
    Hi, I've never used Trie trees, so I cannot help you in this. I hadn't even thought about the actual string search so far, but only for the main algorithm. Thinking about it now, I guess I would use binary search on a list sorted alphabetically for my algorithm, and maybe a hash technique for yours where you are only looking for complete words. (I.e. for 'test' you would look up 't' and 'te' and 'tes' (and find none of These being words), whereas I would look up all words that 'test' starts with (which is only 'test' itself, so no find either). – Thorsten Kettner Oct 09 '15 at 06:27
  • 1
    A short circuit in an algorithm can mean two things: 1. Stop at the first solution rather than looking for more. 2. Cut a path early rather than following it needlessly. In your algorithm the short circuit seems to be of the first kind. And this is not a short circuit at all in my opinion, because the task is to find out whether a word consists of others, not of which. And how often do you actually find two or more solutions for a word (e.g. bear-skin and bears-kin)? – Thorsten Kettner Oct 09 '15 at 07:07
  • Thanks Thorsten for all the inputs, I do not quite catch what do you mean " And how often do you actually find two or more solutions for a word"? How it related to short circuit? More details is appreciated. :) – Lin Ma Oct 09 '15 at 07:57
  • 1
    I gave an example. You want to know whether 'bearskin' is a composed word. You find 'bear' + 'skin' and stop. You could also go on and find 'bears' + 'kin', but what for? The question whether 'bearskin' is a composed word or not is already answered with the first find. So yes, you don't evaluate all combinations, and you *may* call this a short circuit, but then, why would you even *want* to evaluate all combinations instead of just stopping on the first find? It is only natural to stop, so I would call it innate to the algorithm, rather than call it a short circuit. – Thorsten Kettner Oct 09 '15 at 08:38
  • 1
    As to "And how often do you actually find two or more solutions for a word?": 'bearskin' which could be composed of 'bear' + 'skin' or of 'bears' + 'kin' is a very rare case. How many other words can you come up with that can be composed of more then just one combination? – Thorsten Kettner Oct 09 '15 at 08:42
  • Thanks Thorsten, for your comments, "How many other words can you come up with that can be composed of more then just one combination", are you proposing we should evaluate a new word, other than trying multiple combinations as an optimization tactic? – Lin Ma Oct 09 '15 at 21:39
  • Hi Thorsten, I do not quite catch this comments -- "So yes, you don't evaluate all combinations, and you may call this a short circuit, but then, why would you even want to evaluate all combinations instead of just stopping on the first find", I think the problem is just to prove the word could be composed by other words by one composition method, other than finding all combinations if a word could be composed by other smaller words in multiple ways. Confused why you say "evaluate all combinations"? If my understanding is wrong, please feel free to correct me. :) – Lin Ma Oct 09 '15 at 21:42
0

To find longest world using recursion

class FindLongestWord {

public static void main(String[] args) {
    List<String> input = new ArrayList<>(
            Arrays.asList("cat", "banana", "rat", "dog", "nana", "walk", "walker", "dogcatwalker"));

    List<String> sortedList = input.stream().sorted(Comparator.comparing(String::length).reversed())
            .collect(Collectors.toList());

    boolean isWordFound = false;
    for (String word : sortedList) {
        input.remove(word);
        if (findPrefix(input, word)) {
            System.out.println("Longest word is : " + word);
            isWordFound = true;
            break;
        }
    }
    if (!isWordFound)
        System.out.println("Longest word not found");

}

public static boolean findPrefix(List<String> input, String word) {

    boolean output = false;
    if (word.isEmpty())
        return true;
    else {
        for (int i = 0; i < input.size(); i++) {
            if (word.startsWith(input.get(i))) {
                output = findPrefix(input, word.replace(input.get(i), ""));
                if (output)
                    return true;
            }
        }
    }
    return output;

}

}