18

I added lemmatization to my countvectorizer, as explained on this Sklearn page.

from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]

tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer,
                       strip_accents = 'unicode',
                       stop_words = 'english',
                       lowercase = True,
                       token_pattern = r'\b[a-zA-Z]{3,}\b', # keeps words of 3 or more characters
                       max_df = 0.5,
                       min_df = 10)

However, when creating a dtm using fit_transform, I get the error below (of which I can't make sense). Before adding the lemmatization to my vectorizer, the dtm code always worked. I went deeper into the manual, and tried some things with the code, but couldn't find any solution.

dtm_tf = tf_vectorizer.fit_transform(articles)

Update:

After following @MaxU's advice below, the code run without error, however numbers and punctuation were not ommited from my output. I run individual tests to see which of the other functions after LemmaTokenizer() do and do not work. Here is the result:

strip_accents = 'unicode', # works
stop_words = 'english', # works
lowercase = True, # works
token_pattern = r'\b[a-zA-Z]{3,}\b', # does not work
max_df = 0.5, # works
min_df = 10 # works

Appearantly, it is just token_pattern which became inactive. Here is the updated and working code without token_pattern (I just needed to install the 'punkt' and 'wordnet' packages first):

from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]

tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer(),
                                strip_accents = 'unicode', # works 
                                stop_words = 'english', # works
                                lowercase = True, # works
                                max_df = 0.5, # works
                                min_df = 10) # works

For those who want to remove digits, punctuation and words of less than 3 characters (but have no idea how), here is one way that does it for me when working from Pandas dataframe

# when working from Pandas dataframe

df['TEXT'] = df['TEXT'].str.replace('\d+', '') # for digits
df['TEXT'] = df['TEXT'].str.replace(r'(\b\w{1,2}\b)', '') # for words
df['TEXT'] = df['TEXT'].str.replace('[^\w\s]', '') # for punctuation 
Rens
  • 492
  • 1
  • 5
  • 14
  • I don't know the answer to this question. But, the example from sklearn seems sloppy. A lemmatizer needs a part of speech tag to work correctly. This is usually inferred using the pos_tag nltk function before tokenization. – Luv Nov 26 '18 at 11:33

2 Answers2

18

It should be:

tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer(),
# NOTE:                        ---------------------->  ^^

instead of:

tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer,
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • 1
    Thanks for your advice, the code runs indeed without error. However, the functions after `LemmaTokenizer()` do not work anymore. Most importantly, `token_pattern = r'\b[a-zA-Z]{3,}\b'` became inactive (so my topics are full of numbers and punctuation). Is it possible to integrated everything in one step? Or shall I seperate the two? (and remove numbers and punctuation beforehand). – Rens Nov 22 '17 at 00:38
  • @Rens, please open a new question and provide there a small (3-5 rows) reproducible sample data set and your code – MaxU - stand with Ukraine Nov 22 '17 at 07:27
2

Thanks for the code, it helps me. Here is an other way to do with inactive token_pattern :

import re
class LemmaTokenizer:
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        regex_num_ponctuation = '(\d+)|([^\w\s])'
        regex_little_words = r'(\b\w{1,2}\b)'
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc) 
                if not re.search(regex_num_ponctuation, t) and not re.search(regex_little_words, t)]

With Regex into the class LemmaTokenizer.

Lucho
  • 31
  • 2
  • That is a nice added option, and of course the preferred way to go! So thanks. Just a side note: for my topic models, I finally stopped using the lemmatizer, as it created less nice results. Also see: https://mimno.infosci.cornell.edu/papers/schofield_tacl_2016.pdf – Rens Dec 26 '21 at 08:08