1

I am using en_core_web_lg to compare some texts for similarity and I am not getting the expected results.

The issue I guess is that my texts are mostly religious, for example: "Thus hath it been decreed by Him Who is the Source of Divine inspiration." "He, verily, is the Expounder, the Wise." "Whoso layeth claim to a Revelation direct from God, ere the expiration of a full thousand years, such a man is assuredly a lying impostor. "

My question is, is there a way I can check spacy's "dictionary"? Does it include words like "whoso" "layeth" "decreed" or "verily"?

Chicago1988
  • 970
  • 3
  • 14
  • 35
  • See this answer:[How to get all words from Spacy vocab](https://stackoverflow.com/questions/54495502/how-to-get-all-words-from-spacy-vocab?rq=1). – Oliver Mason Jul 29 '21 at 13:01

1 Answers1

0

To check if spaCy knows about individual words you can check tok.is_oov ("is out of vocabulary"), where tok is a token from a doc.

spaCy is trained on a dataset called OntoNotes. While that does include some older texts, like the bible, it's mostly relatively recent newspapers and similar sources. The word vectors are trained on Internet text. I would not expect it to work well with documents of the type you are describing, which are very different from what it has seen before.

I would suggest you train custom word vectors on your dataset, which you can then load into spaCy. You could also look at the HistWords project.

polm23
  • 14,456
  • 7
  • 35
  • 59