6

I try to do typo correction with spaCy, and for that I need to know if a word exists in the vocab or not. If not, the idea is to split the word in two until all segments do exist. As example, "ofthe" does not exist, "of" and "the" do. So I first need to know if a word exists in the vocab. That's where the problems start. I try:

for token in nlp("apple"):
    print(token.lemma_, token.lemma, token.is_oov, "apple" in nlp.vocab)
apple 8566208034543834098 True True

for token in nlp("andshy"):
    print(token.lemma_, token.lemma, token.is_oov, "andshy" in nlp.vocab)
andshy 4682930577439079723 True True

It's clear that this make no sense, in both cases "is_oov" is True, and it is in the vocabulary. I'm looking for something simple like

"andshy" in nlp.vocab = False, "andshy".is_oov = True
"apple" in nlp.vocab = True, "apple".is_oov = False

And in the next step, also some word correction method. I can use the spellchecker library, but that's not consistent with the spaCy vocab

This problem appears to be a frequent question, and any suggestions (code) are most welcome.

thanks,

AHe

user9165100
  • 371
  • 3
  • 11
  • There doesn't seem to be a question here. – erip Dec 29 '19 at 22:49
  • Question is: "how do you do this"? Generalizing the question makes a lot more sense than writing loads of code that does not work (imho). – user9165100 Dec 30 '19 at 10:55
  • I sttill don't know what "this" is, though. What is your question? For tips on how to ask, please refer to [this page](https://stackoverflow.com/help/how-to-ask). – erip Dec 30 '19 at 13:41
  • 1
    The question was: "how do I find a word in the spaCy" vocabulary. Sorry for creating the confusion and ambivalence – user9165100 Dec 30 '19 at 19:40

2 Answers2

6

Short answer: spacy's models do not contain any word lists that are suitable for spelling correction.

Longer answer:

Spacy's vocab is not a fixed list of words in a particular language. It is just a cache with lexical information about tokens that have been seen during training and processing. Checking whether a token is in nlp.vocab just checks whether a token is in this cache, so it's is not a useful check for spelling correction.

Token.is_oov has a more specific meaning that's not obvious from its short description in the docs: it reports whether the model contains some additional lexical information about this token like Token.prob. For a small spacy model like en_core_web_sm that doesn't contain any probabilities, is_oov will be True for all tokens by default. The md and lg models contain lexical information about 1M+ tokens and the word vectors contain 600K+ tokens, but these lists are too large and noisy to be useful for spelling correction.

aab
  • 10,858
  • 22
  • 38
  • Thanks. I'll use the spellchecker (https://pypi.org/project/pyspellchecker/) with some personal hacks. For example, an extensive list of typos that occur in your corpus ... – user9165100 Jan 08 '20 at 20:58
0

For spellchecking, you can try spacy_hunspell. You can add this to the pipeline.

More info and sample code is here: https://spacy.io/universe/project/spacy_hunspell

piernik
  • 177
  • 1
  • 12