2

I'm writing an application and I'm faced with the task to find possible words in a dictionary based on an input string and a description of what to search for. The dictionary is a text file (one word per row) and contains around 220,000 words.

An input string can consist of four things:

  • Normal characters A-Z
  • Joker *. This can be any character A-Z
  • Vowel @. The character must be a vowel
  • Consonant #. The character must be a consonant

For example, the input string *AT@# should return words like "rated", "satin", "later" etc. but not the word "ratio" because it doesn't end with a consonant.

A description is used to tell how the input string should appear in the word. It can be:

  • Words that begin with. *AT@# as input returns words like "material".
  • Words that end with. *AT@# as input returns words like "refrigerator".
  • Words that contain. *AT@# as input returns words like "catered"
  • Words that fit. *AT@# as input returns words like "hater".

The first thing to figure out is the best data structure for the dictionary. Since I have the descriptions to think about, I'm not sure a tree structure is the best way to go. It seems to be good for prefix searching and I can probably create another tree for reversed words to handle suffix searching. I'm not sure about words that contain a sequence of characters though. A tree doesn't feel right. On the other hand I can't think of anything else. Which data structures shall I use for each of my descriptions?

I'm also thinking about creating a regular expression based on the input string and the description and then match it against every string in the dictionary. However, I haven't used regular expression before so I don't know how expensive this is.

Thanks in advance!

  • 2
    Searching 220,000 words the "dumb" way, as long they're already in memory (so you're not loading the file on every search), will probably take less than 0.1 seconds. – user253751 Feb 10 '15 at 23:21
  • This may be a slight duplicate of what you're asking, but... [Best data structure for dictionary implementation](http://stackoverflow.com/questions/10017808/best-data-structure-for-dictionary-implementation) – Ascalonian Feb 10 '15 at 23:25
  • @immibis Thanks for the input. My dictionary isn't a good one though. My wish is to get my hands on one with close to a million words. I'm hoping to learn something new and not just stick to an ugly solution! – Richard Silvertass Feb 10 '15 at 23:44
  • @Ascalonian Had a look at that question before I posted mine. As mentioned, tree structures seem to work excellent for prefix and suffix but I'm not sure about my other two descriptions. – Richard Silvertass Feb 10 '15 at 23:47

1 Answers1

0

In one of my classes we used a trie data structure to store a dictionary. Each node of the trie has a string that is just its letter and it has children representing any letter that could follow it based on the words in the dictionary. For example if the letter of the first trie node was 'a' and apple, abraham and acorn were in the dictionary, the node would have child nodes of 'p', 'b' and 'c'. Each node also has a boolean that denotes whether or not it is the final letter of any word the dictionary contains. You then check the words presence in the dictionary by comparing the first and then subsequent letters in your input word with the available child nodes. The advantage is that the worst possible performance you can have is 26 times the number of letters in the word you are searching.

richtarj
  • 61
  • 4