I have fairly large dictionary (200K words, 2-16 chars length) and various input strings (5-200 words separated by spaces, 2-20 chars length). Using PHP in cli mode I need to compare each input word to words in dictionary and calculate minimum levenshtein distance with almost maximum efficiency - how can I do it?
What I have already tried:
- Implemented my own basic comparison algorithm (with exponential complexity) - very slow.
- Implemented my own advanced comparison algorithm (based on words length) - faster, but still very slow.
- Converted dictionary into Trie data structure and implemented search in a Trie - faster than p.2, but still not enough.
- Added extra hashtable for cases when input word match dict word (zero distance). Also input words with calculated distance are put into hashtable so they are not calulated twice when repeated in input string. STILL NOT FAST ENOUGH.
What I'm thinking of:
- Precalulating Trie nodes for dictionary as the latter never changes.
- Using arrays instead of objects for Trie nodes.
- Implementing another algorithm? Like n-grams or levenshtein automata? Not sure whether it is worth it.