I'm writing an application and I'm faced with the task to find possible words in a dictionary based on an input string and a description of what to search for. The dictionary is a text file (one word per row) and contains around 220,000 words.
An input string can consist of four things:
- Normal characters A-Z
- Joker *. This can be any character A-Z
- Vowel @. The character must be a vowel
- Consonant #. The character must be a consonant
For example, the input string *AT@# should return words like "rated", "satin", "later" etc. but not the word "ratio" because it doesn't end with a consonant.
A description is used to tell how the input string should appear in the word. It can be:
- Words that begin with. *AT@# as input returns words like "material".
- Words that end with. *AT@# as input returns words like "refrigerator".
- Words that contain. *AT@# as input returns words like "catered"
- Words that fit. *AT@# as input returns words like "hater".
The first thing to figure out is the best data structure for the dictionary. Since I have the descriptions to think about, I'm not sure a tree structure is the best way to go. It seems to be good for prefix searching and I can probably create another tree for reversed words to handle suffix searching. I'm not sure about words that contain a sequence of characters though. A tree doesn't feel right. On the other hand I can't think of anything else. Which data structures shall I use for each of my descriptions?
I'm also thinking about creating a regular expression based on the input string and the description and then match it against every string in the dictionary. However, I haven't used regular expression before so I don't know how expensive this is.
Thanks in advance!