How to find if a string contains a word from a dictionary?

Question

i need to find out a string that is made by removing a space between two words contains a word from a dictionary.

I already have stored in a dictionary in a BST.

I get as a input a text file with random spaces removed. For example:

We left in pretty good time, and came after nightfallto Klausenburgh. Here I stopped for the night at the Hotel Royale. I had for dinner, or rather supper, a chicken done up some way with red pepper, which was very goodbut thirsty. (Mem., get recipe for Mina.) I asked the waiter, and he said it was called "paprika hendl," and that, as it was a nationaldish, I should be able to get it anywhere along the Carpathians. I found my smattering of German very useful here; indeed, I don't know how I should be able to get on without it.

I read the file and saved every word in a list. I need to verify if a word is in the dictionary and count its frequency, i already did this part. the hard part is that i need to verify if i can get words in the dictionary from a space removed string.

For example, 'goodbut' should give me 'good' and should be added in the frequency counter. since 'but' is not in my dictionary.

I have a list with all the strings from the text file that was not in the dictionary when i looked for the frequencies. i need to go trough those words to see if i can get a legal word in them.

But i don't know how. nor where to start

_But i don't know how. nor where to start_ Start by posting your code and indicate which parts you need help with. — Abra, Nov 20 '19 at 19:52
Welcome to stack overflow! In order for us to help you, it would be very beneficial if you could please provide example of the output you expect. Also try to provide a more readable input if possible. Finally, have you tried anything from which we can help you? — Antonio López Ruiz, Nov 20 '19 at 19:52
This is very similar to a question I asked years ago: https://stackoverflow.com/questions/5922956/java-dictionary-searcher You need to split your strings that are not in the found set of strings (or be smarter and apply one of the subsiquent approaches). — Brendan Lesniak, Nov 20 '19 at 19:54
What about compound words, like "sidewalk"? These are words in their own right, but composed of smaller words. — erickson, Nov 20 '19 at 23:17

juancn · Accepted Answer · 2019-11-20T20:51:50.190

For each word in the text:

Iterable<String>  words = ...;
for (String word : words) {
    processSubWords(word);
}

You want to generate each possible sub-word (this can only happen for words with 2 or more characters):

void processSubWords(String word) {
    if (word.length() > 1) {
        for (int i = 1; i < word.length(); i++) {
            final String left = word.substring(0, i);
            final String right = word.substring(i);
            lookupAndUpdate(left);
            lookupAndUpdate(right);
        }
    }
}

Then in lookupAndUpdate you would do a dictionary lookup and update as necessary if there's a match.

As an example, if you passed goodbut to processSubWords, it would call lookupAndUpdate with the following strings:

g
oodbut
go
odbut
goo
dbut
good
but
goodb
ut
goodbu
t

Of those, only good should (likely) match your dictionary.

what about "oo", "dbu", "odbu", etc ? aren't these "sub-words" ? — LowKeyEnergy, Nov 20 '19 at 21:05
Not as defined. From the description of the problem only a single space was erased. So you have to re-insert it. There's no single space insertion that turns 'goodbut' into 'oo' and another substring. You would need to insert two spaces. — juancn, Nov 29 '19 at 18:49

score -1 · Answer 2 · answered Nov 20 '19 at 19:57

-1

I think a regex matcher with counter should do the desired result. The example code will be something like this:

public int countWords(String key, String source) {      
    Pattern pattern = Pattern.compile(key);
    Matcher matcher = pattern.matcher(source);

    int count = 0;
    while (matcher.find()) {
        count++;
    }
    return count;
}

Where key is from your example the word "good" and the source is the text. The method returned count 2 for this setup.

answered Nov 20 '19 at 19:57

Svilen Yanovski

363
1
9

His example doesn't contain the word "good". It contains "goodbut". – LowKeyEnergy Nov 20 '19 at 20:14
The matcher gets "good" (5th word in the text) and "goodbut" in bold (space removed string). No other matches, and the frequency counter is 2. – Svilen Yanovski Nov 20 '19 at 20:41
I don't think you understood his question. How can he determine if "goodbut" is composed of dictionary words? – LowKeyEnergy Nov 20 '19 at 20:45
He have in the dictionary different words (he said he already extracted them), we assume (as the guy is not here anymore) that the word "good" is into the dictionary and he needs to find the word into the text even it is with removed space - "good but" -> "goodbut". Here we use regex to match these examples. The regex matches both "good" (1 to 1 similarity with hte key word) and "goodbut" (keyword, but with removed space to the next word). What is left is to loop through the dictionary and call this method. – Svilen Yanovski Nov 20 '19 at 20:57
This doesn't solve the problem. – LowKeyEnergy Nov 20 '19 at 21:05

How to find if a string contains a word from a dictionary?

2 Answers2