7

I'm trying to do data enhancement with a FAQ dataset. I change words, specifically nouns, by most similar words with Wordnet checking the similarity with Spacy. I use multiple for loop to go through my dataset.

import spacy
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd

nlp = spacy.load('en_core_web_md')
nltk.download('wordnet')
questions = pd.read_csv("FAQ.csv")

list_questions = []
for question in questions.values:
    list_questions.append(nlp(question[0]))

for question in list_questions: 
    for token in question:
        treshold = 0.5
        if token.pos_ == 'NOUN':
            wordnet_syn = wn.synsets(str(token), pos=wn.NOUN)  
            for syn in wordnet_syn:
                for lemma in syn.lemmas():
                    similar_word = nlp(lemma.name())
                    if similar_word.similarity(token) != 1. and similar_word.similarity(token) > treshold:
                        good_word = similar_word
                        treshold = token.similarity(similar_word)

However, the following warning is printed several times and I don't understand why :

UserWarning: [W008] Evaluating Doc.similarity based on empty vectors.

It is my similar_word.similarity(token) which creates the problem but I don't understand why. The form of my list_questions is :

list_questions = [Do you have a paper or other written explanation to introduce your model's details?, Where is the BERT code come from?, How large is a sentence vector?]

I need to check token but also the similar_word in the loop, for example, I still get the error here :

tokens = nlp(u'dog cat unknownword')
similar_word = nlp(u'rabbit')

if(similar_word):
    for token in tokens:
        if (token):
            print(token.text, similar_word.similarity(token))
Jonor
  • 1,102
  • 2
  • 15
  • 31

2 Answers2

13

You get that error message when similar_word is not a valid spacy document. E.g. this is a minimal reproducible example:

import spacy

nlp = spacy.load('en_core_web_md')  # make sure to use larger model!
tokens = nlp(u'dog cat')
#similar_word = nlp(u'rabbit')
similar_word = nlp(u'')

for token in tokens:
  print(token.text, similar_word.similarity(token))

If you change the '' to be 'rabbit' it works fine. (Cats are apparently just a fraction more similar to rabbits than dogs are!)

(UPDATE: As you point out, unknown words also trigger the warning; they will be valid spacy objects, but not have any word vector.)

So, one fix would be to check similar_word is valid, including having a valid word vector, before calling similarity():

import spacy

nlp = spacy.load('en_core_web_md')  # make sure to use larger model!
tokens = nlp(u'dog cat')
similar_word = nlp(u'')

if(similar_word and similar_word.vector_norm):
  for token in tokens:
    if(token and token.vector_norm):
      print(token.text, similar_word.similarity(token))

Alternative Approach:

You could suppress the particular warning. It is W008. I believe setting an environmental variable SPACY_WARNING_IGNORE=W008 before running your script would do it. (Not tested.)

(See source code)


By the way, similarity() might cause some CPU load, so is worth storing in a variable, instead of calculating it three times as you currently do. (Some people might argue that is premature optimization, but I think it might also make the code more readable.)

Darren Cook
  • 27,837
  • 13
  • 117
  • 217
  • 1
    Thanks for your answer, but I still get an error. I need to check also `token`. I edited my post with your example to show you the error. – Jonor May 02 '19 at 09:44
  • @Jonor You're right (I had tried that, but it seems the nonsense word I used actually existed in the web_md model!!) After studying the source (https://github.com/explosion/spaCy/blob/68900066e060b6e2fd7b74e343e6b4c93d8d96c2/spacy/tokens/token.pyx#L194) I've updated my answer. – Darren Cook May 02 '19 at 10:31
  • @DarrenCook How to suppress the error. I am also facing same problem. Where should one make changes to that is suggested. I could find the code errors.py (link here: github.com/explosion/spaCy/blob/…) but not sure where exactly to set the "SPACY_WARNING_IGNORE=W008". I am using window 10, IDE: spyder, Spacy 2.2.5. It would be great if you could tell me how and where to set the environment variable "SPACY_WARNING_IGNORE=W008" – Ridhima Kumar Dec 15 '19 at 11:36
  • 2
    @RidhimaKumar See https://docs.python.org/3/library/os.html#os.environ I *think* you will need to do that before you import spacy. Or, to set it outside the script I googled "windows python how to set environmental variable" and it found various answers. – Darren Cook Dec 15 '19 at 12:35
  • @DarrenCook Thanks for your reply. Sorry for seeking further clarification. "I think you will need to do that before you import spacy" so if I am inferring it correctly, the environment variable 'SPACY_WARNING_IGNORE=W008' can be set in my python script itself ? (i.e the script having the above similarity function) ? – Ridhima Kumar Dec 15 '19 at 16:05
  • 1
    @RidhimaKumar One of the top Google hits I got was this StackOverflow page: https://stackoverflow.com/questions/5971312/how-to-set-environment-variables-in-python (which was what led me to that manual page :-) ) (P.S. if that isn't clear, then yes I think it can all be done from inside the python script. I've not personally tried, though.) – Darren Cook Dec 15 '19 at 16:27
  • @DarrenCook Thank you I got it . – Ridhima Kumar Dec 15 '19 at 17:23
1

I have suppress the W008 warning by setting environmental variable by using this code in run file.

import os
app = Flask(__name__)

app.config['SPACY_WARNING_IGNORE'] = "W008"
os.environ["SPACY_WARNING_IGNORE"] = "W008"

if __name__ == "__main__":
app.run(host='0.0.0.0', port=5000)
Ferdous Wahid
  • 3,227
  • 5
  • 27
  • 28
  • 5
    As of spacy 2.3, you should use the standard warnings module from python to filter out warnings: `warnings.filterwarnings("ignore", message=r"\[W008\]", category=UserWarning)`. As described in the migration guide: https://spacy.io/usage/v2-3 – tupui Nov 16 '20 at 08:30