how to fix stopwords preprocessing inconsistency?

Question

I get a UserWarning thrown every time I execute this function. Here user_input is a list of words, and article_sentences a list of lists of words.

I've tried to remove all stop words out of the list beforehand but this didn't change anything.

def generate_response(user_input):
    sidekick_response = ''
    article_sentences.append(user_input)

    word_vectorizer = TfidfVectorizer(tokenizer=get_processed_text, stop_words='english')
    all_word_vectors = word_vectorizer.fit_transform(article_sentences) # this is the problematic line
    similar_vector_values = cosine_similarity(all_word_vectors[-1], all_word_vectors)
    similar_sentence_number = similar_vector_values.argsort()[0][-2]

this is a part of a function for a simple chatbot I found here: https://stackabuse.com/python-for-nlp-creating-a-rule-based-chatbot/ it should return a sorted list of sentences sorted by how much they match the user_input, which it does but it also throws this UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words.

This is essentially a duplicate of https://stackoverflow.com/questions/57340142/user-warning-your-stop-words-may-be-inconsistent-with-your-preprocessing where the question is more easily answered. — joeln, Aug 07 '19 at 11:20

sheth7 · Answer 1 · 2019-07-20T22:46:31.290

There seems to be an issue with preprocessing.

From my personal experience, the stemming step in preprocessing leads to certain stems such as separating ing from the word financing to keep the stem financ. Eventually, these carry forward and cause inconsistencies with the TFIDF_Vectorizer -> stop_words list.

You can see this post to get some more info on this - Python stemmer issue: wrong stem

You can also try to avoid the stemming process and only tokenize. This will at least solve the inconsistencies error.

score 0 · Accepted Answer · answered Aug 08 '19 at 10:30

0

This user warning issue has been discussed here. As @jnothman says:

...make sure that you preprocess your stop list to make sure that it is normalised like your tokens will be, and pass the list of normalised words as stop_words to the vectoriser.

answered Aug 08 '19 at 10:30

BringBackCommodore64

4,910
3
27
30

how to fix stopwords preprocessing inconsistency?

2 Answers2