I am trying to clean some text data. fisrt i removed the stop words, then i tried to Lemmatize the text. But words such as nouns are removed
Sample Data
https://drive.google.com/file/d/1p9SKWLSVYeNScOCU_pEu7A08jbP-50oZ/view?usp=sharing udpated Code
# Libraries
import spacy
import pandas as pd
import gensim
from gensim.utils import simple_preprocess
import nltk; nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['covid', 'COVID-19', 'coronavirus'])
article= pd.read_csv("testdata.csv")
data = article.title.values.tolist()
nlp = spacy.load('en_core_web_sm')
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuations
data_words = list(sent_to_words(data))
def remove_stopwords(texts):
return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
data_words_nostops = remove_stopwords(data_words)
print ("*** Text After removing Stop words: ")
print(data_words_nostops)
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV','PRON']):
"""https://spacy.io/api/annotation"""
texts_out = []
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
return texts_out
data_lemmatized = lemmatization(data_words_nostops, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV','PRON'])
print ("*** Text After Lemmatization: ")
print(data_lemmatized)
The output after removing Stopwords is :
[['qaia', 'flags', 'amman', 'melbourne', 'jetstar', 'flights', 'recovery', 'plan'],
['western', 'amman', 'suburb', 'new','nsw', 'ground', 'zero', children],
['flight', 'returned', 'amman','qaia', 'staff', 'contract','driving'], ]]
The output after Lematization :
[['flight', 'recovery', 'plan']
['suburb', 'ground']
['return', 'contract','driving']
on each reacord I do not understand the following :
-1st reord: why these words are removed: "'qaia', 'flags', 'amman', 'melbourne', 'jetstar'
-2ed recored: essential words are reomved same as the first reord, Also, I was expecting children to convert to child
-3ed, "driving" is not converted to "drive"
I was expecting that words will such as "Amman" will not removed, Also i am expecting the words will be converted from plural to singular. And the verbs will be converted to the infinitive ...
What i am missing here??? Thanx in advance