1

A simple DataFrame and I am applying Lemmatization to it.

Some words remained unchanged so I am looking for if a smart way to customize the Lemmatization.

import pandas as pd

samples = ["mike discussed the project",
       "kate visited jack",
       "tom received greetings",
       "let them discuss",
       "regular visits"]

train = pd.DataFrame(samples)

train[0] = train[0].apply(lambda x: " ".join(x.lower() for x in x.split()))

# Lemmatization
from textblob import Word
train[0] = train[0].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

freq = pd.Series(' '.join(train[0]).split()).value_counts()

print freq.to_string()

The output is:

kate         1
them         1
the          1
visited      1
visit        1
tom          1
jack         1
let          1
regular      1
project      1
greeting     1
discussed    1
discus       1
mike         1
received     1

There are some words remained unchanged: visited, discussed, received (and "discuss" was changed to "discus")

I can add on below lines before Lemmatization.

But what's the better way? Can the Lemmatization be customized?

# train[0] = train[0].str.replace('discussed', 'discuss')
# train[0] = train[0].str.replace('visited', 'visit')
# train[0] = train[0].str.replace('received', 'receive')

btw, tried NLTK's WordNetLemmatizer, it's the same. and I read Python NLTK Lemmatization of the word 'further' with wordnet but still don't have a clue.

Mark K
  • 8,767
  • 14
  • 58
  • 118
  • 1
    Out of context, the lemmatizer has no idea whether "discussed" is an adjective or a participle. Maybe you want to apply some sort of parsing to attempt to disambiguate these. – tripleee Jan 17 '19 at 07:56
  • @tripleee, thanks for the comment. maybe some occasions it's better not to use lemmatizer. – Mark K Jan 17 '19 at 08:07

0 Answers0