20

I am following this document clustering tutorial. As an input I give a txt file which can be downloaded here. It's a combined file of 3 other txt files divided with a use of \n. After creating a tf-idf matrix I received this warning:

,,UserWarning: Your stop_words may be inconsistent with your preprocessing. 
Tokenizing the stop words generated tokens ['abov', 'afterward', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anyon', 'anyth', 'anywher', 'becam', 'becaus', 'becom', 'befor', 'besid', 'cri', 'describ', 'dure', 'els', 'elsewher', 'empti', 'everi', 'everyon', 'everyth', 'everywher', 'fifti', 'forti', 'henc', 'hereaft', 'herebi', 'howev', 'hundr', 'inde', 'mani', 'meanwhil', 'moreov', 'nobodi', 'noon', 'noth', 'nowher', 'onc', 'onli', 'otherwis', 'ourselv', 'perhap', 'pleas', 'sever', 'sinc', 'sincer', 'sixti', 'someon', 'someth', 'sometim', 'somewher', 'themselv', 'thenc', 'thereaft', 'therebi', 'therefor', 'togeth', 'twelv', 'twenti', 'veri', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev', 'whi', 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))". 

I guess it has something to do with the order of lemmatization and stop words removal, but as this is my first project in txt processing, I am a bit lost and I don't know how to fix this...

import pandas as pd
import nltk
from nltk.corpus import stopwords
import re
import os
import codecs
from sklearn import feature_extraction
import mpld3
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer


stopwords = stopwords.words('english')
stemmer = SnowballStemmer("english")

def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems


def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens


totalvocab_stemmed = []
totalvocab_tokenized = []
with open('shortResultList.txt', encoding="utf8") as synopses:
    for i in synopses:
        allwords_stemmed = tokenize_and_stem(i)  # for each item in 'synopses', tokenize/stem
        totalvocab_stemmed.extend(allwords_stemmed)  # extend the 'totalvocab_stemmed' list
        allwords_tokenized = tokenize_only(i)
        totalvocab_tokenized.extend(allwords_tokenized)

vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
print ('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame')
print (vocab_frame.head())

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

with open('shortResultList.txt', encoding="utf8") as synopses:
    tfidf_matrix = tfidf_vectorizer.fit_transform(synopses) #fit the vectorizer to synopses

print(tfidf_matrix.shape)
Dima Lituiev
  • 12,544
  • 10
  • 41
  • 58

4 Answers4

28

The warning is trying to tell you that if your text contains "always" it will be normalised to "alway" before matching against your stop list which includes "always" but not "alway". So it won't be removed from your bag of words.

The solution is to make sure that you preprocess your stop list to make sure that it is normalised like your tokens will be, and pass the list of normalised words as stop_words to the vectoriser.

joeln
  • 3,563
  • 25
  • 31
  • 2
    It's quite an annoying task t.b.h. If I define `please` as stopword, the vectoriser complains because it gets tokenized to `pleas`. If I pass `pleas` (i.e. the tokenized stop word), the vectoriser complains because it gets tokenized to `plea`. – MERose Apr 20 '20 at 17:48
  • 1
    @MERose, I'm still not convinced that there's a best way to deal with this. Different packages have [different stop word lists](https://medium.com/@saitejaponugoti/stop-words-in-nlp-5b248dadad47) and tokenization/lemmatization methods. There's a workaround, to be used with or without a stop word list, that is, corpus-based stop words. You can control this with `max_df` and `min_df` in `CountVectorizer()` and `TfidfVectorizer()`. See **stop_words** [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). – andreassot10 Nov 18 '20 at 10:22
  • In order to tokenize the stop words: `tokenizer = TfidfVectorizer().build_tokenizer()` and then `my_stop_words = sum([tokenizer(stop_word) for stop_word in my_stop_words], [])` – Alaa M. Mar 07 '23 at 12:09
7

I had the same problem and for me the following worked:

  1. include stopwords into tokenize function and then
  2. remove stopwords parameter from tfidfVectorizer

Like so:

1.

stopwords = stopwords.words('english')
stemmer = SnowballStemmer("english")

def tokenize_and_stem(text):
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)

    #exclude stopwords from stemmed words
    stems = [stemmer.stem(t) for t in filtered_tokens if t not in stopwords]

    return stems
  1. Delete stopwords parameter from vectorizer:
tfidf_vectorizer = TfidfVectorizer(
    max_df=0.8, max_features=200000, min_df=0.2,
    use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3)
)
Suki
  • 177
  • 3
  • 12
  • I tested your code but tbh it ain't that good. It considered "Khadse" as "khads". – edusanketdk Jun 17 '21 at 13:33
  • 1
    @edusanketdk my code provides a well working solution to the question asked. the problem you're describing is a different issue and not due to my code but mostly due to the functionality of the library. Please note that my answer is explicitly tailored to "english" as a language and "Khadse" is not a native English word. – Suki Jun 19 '21 at 12:09
  • indeed. But the data is always going to contain nouns and names which are not in the english vocabulary. So it is obviously expected to create a robust model which does not fail like that. Also, I understand it's the library's problem, so we should not use them, or atleast with some optimizations. Right? – edusanketdk Jun 20 '21 at 15:01
  • @edusanketdk I suppose in this case you could manually extend the stopword list to include frequent non-English words. From what I've seen, the nltk library is still one of the widest-used and best-working ones out there. Feel free to add another answer though with a better library if you know of one. – Suki Aug 12 '21 at 13:21
  • I think there is a typo in the code. In the first code snippet it should be: `stems = [stemmer.stem(t) for t in filtered_tokens if t not in stopwords]` – Gaspar Avit Ferrero Nov 02 '21 at 11:47
  • 1
    @GasparAvitFerrero yes I think you might be right, well spotted. – Suki Nov 02 '21 at 15:55
4

I faced this problem because of PT-BR language.

TL;DR: Remove the accents of your language.

# Special thanks for the user Humberto Diogenes from Python List (answer from Aug 11, 2008)
# Link: http://python.6.x6.nabble.com/O-jeito-mais-rapido-de-remover-acentos-de-uma-string-td2041508.html

# I found the issue by chance (I swear, haha) but this guy gave the tip before me
# Link: https://github.com/scikit-learn/scikit-learn/issues/12897#issuecomment-518644215

import spacy
nlp = spacy.load('pt_core_news_sm')

# Define default stopwords list
stoplist = spacy.lang.pt.stop_words.STOP_WORDS

def replace_ptbr_char_by_word(word):
  """ Will remove the encode token by token"""
    word = str(word)
    word = normalize('NFKD', word).encode('ASCII','ignore').decode('ASCII')
    return word

def remove_pt_br_char_by_text(text):
  """ Will remove the encode using the entire text"""
    text = str(text)
    text = " ".join(replace_ptbr_char_by_word(word) for word in text.split() if word not in stoplist)
    return text

df['text'] = df['text'].apply(remove_pt_br_char_by_text)

I put the solution and references in this gist.

Flavio
  • 759
  • 1
  • 11
  • 24
0

Manually adding those words in the 'stop_words' list can solve the problem.

stop_words = safe_get_stop_words('en')
stop_words.extend(['abov', 'afterward', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anyon', 'anyth', 'anywher', 'becam', 'becaus', 'becom', 'befor', 'besid', 'cri', 'describ', 'dure', 'els', 'elsewher', 'empti', 'everi', 'everyon', 'everyth', 'everywher', 'fifti', 'forti', 'henc', 'hereaft', 'herebi', 'howev', 'hundr', 'inde', 'mani', 'meanwhil', 'moreov', 'nobodi', 'noon', 'noth', 'nowher', 'onc', 'onli', 'otherwis', 'ourselv', 'perhap', 'pleas', 'sever', 'sinc', 'sincer', 'sixti', 'someon', 'someth', 'sometim', 'somewher', 'themselv', 'thenc', 'thereaft', 'therebi', 'therefor', 'togeth', 'twelv', 'twenti', 'veri', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev', 'whi', 'yourselv'])
user67275
  • 1
  • 9
  • 38
  • 64