I try to do typo correction with spaCy, and for that I need to know if a word exists in the vocab or not. If not, the idea is to split the word in two until all segments do exist. As example, "ofthe" does not exist, "of" and "the" do. So I first need to know if a word exists in the vocab. That's where the problems start. I try:
for token in nlp("apple"):
print(token.lemma_, token.lemma, token.is_oov, "apple" in nlp.vocab)
apple 8566208034543834098 True True
for token in nlp("andshy"):
print(token.lemma_, token.lemma, token.is_oov, "andshy" in nlp.vocab)
andshy 4682930577439079723 True True
It's clear that this make no sense, in both cases "is_oov" is True, and it is in the vocabulary. I'm looking for something simple like
"andshy" in nlp.vocab = False, "andshy".is_oov = True
"apple" in nlp.vocab = True, "apple".is_oov = False
And in the next step, also some word correction method. I can use the spellchecker library, but that's not consistent with the spaCy vocab
This problem appears to be a frequent question, and any suggestions (code) are most welcome.
thanks,
AHe