A simple DataFrame and I am applying Lemmatization to it.
Some words remained unchanged so I am looking for if a smart way to customize the Lemmatization.
import pandas as pd
samples = ["mike discussed the project",
"kate visited jack",
"tom received greetings",
"let them discuss",
"regular visits"]
train = pd.DataFrame(samples)
train[0] = train[0].apply(lambda x: " ".join(x.lower() for x in x.split()))
# Lemmatization
from textblob import Word
train[0] = train[0].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
freq = pd.Series(' '.join(train[0]).split()).value_counts()
print freq.to_string()
The output is:
kate 1
them 1
the 1
visited 1
visit 1
tom 1
jack 1
let 1
regular 1
project 1
greeting 1
discussed 1
discus 1
mike 1
received 1
There are some words remained unchanged: visited, discussed, received (and "discuss" was changed to "discus")
I can add on below lines before Lemmatization.
But what's the better way? Can the Lemmatization be customized?
# train[0] = train[0].str.replace('discussed', 'discuss')
# train[0] = train[0].str.replace('visited', 'visit')
# train[0] = train[0].str.replace('received', 'receive')
btw, tried NLTK's WordNetLemmatizer, it's the same. and I read Python NLTK Lemmatization of the word 'further' with wordnet but still don't have a clue.