1

Provided a list of valid words, and a search word, I want to find whether the search word is a valid word or not ALLOWING 2 typo characters.

What would be a good data structure to store a dictionary of words(assumingly it contains a million words) and algorithm to find whether the word exists in the dictionary(allowing 2 typo characters).

If no typo characters where allowed, then a Trie would be a good way to store the words but not sure if it stays the best way to store dictionary when typos are allowed. Not sure what the complexity for a backtracking algorithm(to search for a word in Trie allowing 2 typos) would be. Any idea about it?

  • Related (and containing answers pertinent to this question): https://stackoverflow.com/questions/10017808/best-data-structure-for-implementing-a-dictionary?rq=1 – Adrian McCarthy Feb 01 '19 at 21:00
  • Also related is my answer on this one, which can quickly find good candidate words without focusing on a specific edit distance: https://stackoverflow.com/questions/2294915/what-algorithm-gives-suggestions-in-a-spell-checker – Adrian McCarthy Feb 01 '19 at 21:04

2 Answers2

0

You might want to checkout Directed Acyclic Word Graph or DAWG. It has more of an automata structure than a tree of graph structure. Multiple possibilities from one place may provide you with your solution.

Samarth Juneja
  • 766
  • 1
  • 8
  • 21
0

If there is no need to also store all mistyped words I would consider to use a two-step approach for this problem.

  1. Build a set containing hashes of all valid words (not including typos). So probably we are talking here about some 10.000 entries, which should still allow quite fast lookups with a binary search. If the hash of a word is found in the set it is typed correctly.

  2. If a words hash is not found in the set the word is probably mistyped. So calculate a the Damerau-Levenshtein distance between the word and all known words to figure out what the user might have meant. To gain some performance here modify the DL-algorithm to abort calculation if the distance gets bigger than your allowed threshold of 2 typos.

elite gamer88
  • 73
  • 1
  • 1
  • 13
J. Mueller
  • 121
  • 1
  • 7