spacy aggressive lemmatization and removing unexpected words

Question

I am trying to clean some text data. fisrt i removed the stop words, then i tried to Lemmatize the text. But words such as nouns are removed

Sample Data

https://drive.google.com/file/d/1p9SKWLSVYeNScOCU_pEu7A08jbP-50oZ/view?usp=sharing udpated Code

# Libraries  
import spacy
import pandas as pd
import gensim
from gensim.utils import simple_preprocess
import nltk; nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['covid', 'COVID-19', 'coronavirus'])

article= pd.read_csv("testdata.csv")
data = article.title.values.tolist()
nlp = spacy.load('en_core_web_sm')

def sent_to_words(sentences):
    for sentence in sentences:
      yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
data_words_nostops = remove_stopwords(data_words)
print ("*** Text  After removing Stop words:   ")
print(data_words_nostops)
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV','PRON']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out
data_lemmatized = lemmatization(data_words_nostops, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV','PRON'])
print ("*** Text  After Lemmatization:   ")

print(data_lemmatized)

The output after removing Stopwords is :

[['qaia', 'flags', 'amman', 'melbourne', 'jetstar', 'flights', 'recovery', 'plan'],
['western', 'amman', 'suburb', 'new','nsw', 'ground', 'zero', children],
['flight', 'returned', 'amman','qaia', 'staff', 'contract','driving'], ]]

The output after Lematization :

[['flight', 'recovery', 'plan']

['suburb', 'ground']

['return', 'contract','driving']

on each reacord I do not understand the following :

-1st reord: why these words are removed: "'qaia', 'flags', 'amman', 'melbourne', 'jetstar'

-2ed recored: essential words are reomved same as the first reord, Also, I was expecting children to convert to child

-3ed, "driving" is not converted to "drive"

I was expecting that words will such as "Amman" will not removed, Also i am expecting the words will be converted from plural to singular. And the verbs will be converted to the infinitive ...

What i am missing here??? Thanx in advance

The words removed look like proper nouns. Try adding `PROPN` to your `allowed_postags`. Your expectations about lemmatization are correct, however Spacy's lemmatizer isn't great. If you need better performance you can try [lemminflect](https://github.com/bjascob/LemmInflect). — bivouac0, Nov 28 '20 at 00:12
BTW... I noticed that you're running the version of the sentence with stop words removed through Spacy's `nlp`. This could be messing up the assignment of pos tags, which will interfere with lemmatization, etc.. Check the tags that Spacy assigns to your test sentences to see if they are right and consider processing your full sentence through `nlp`. — bivouac0, Nov 28 '20 at 00:22
@bivouac0 Thank you for your comment. Regarding the stopword I extended the englihsh word list like this `stop_words = stopwords.words('english'); stop_words.extend(['covid', 'COVID-19', 'coronavirus'])` bur i deactivated because i wanted to check the behaviour of the lemmatizer — almegdadi, Nov 28 '20 at 07:18
@bivouac0 I added the `PROPN` to the `allowed_postags` .. This works great for words like **"Amman" , "flights" ** ... BUT , words like **"children"** didnot convert to **"child"** — almegdadi, Nov 28 '20 at 07:25

score 2 · Accepted Answer · answered Nov 28 '20 at 14:22

I'm guessing that most of your issues are because you're not feeding spaCy full sentences and it's not assigning the correct part-of-speech tags to your words. This can cause the lemmatizer to return the wrong results. However, since you've only provided snippets of code and none of the original text, it's difficult to answer this question. Next time consider boiling down your question to a few lines of code that someone else can run on their machine EXACTLY AS WRITTEN, and providing a sample input that fails. See Minimal Reproducible Example

Here's an example that works and is close to what you're doing.

import spacy
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
allow_postags = set(['NOUN', 'VERB', 'ADJ', 'ADV', 'PROPN'])
nlp = spacy.load('en')
text = 'The children in Amman and Melbourne are too young to be driving.'
words = []
for token in nlp(text):
    if token.text not in stop_words and token.pos_ in allow_postags:
        words.append(token.lemma_)
print(' '.join(words))

This returns child Amman Melbourne young drive

You still have the same issue, spaCy's lemmatizer doesn't work correctly after you run it through gensim because you remove stopwords, punctuation and convert everything to lower-case. You need to run spacy on the original sentence and then extract the words you want to keep. You shouldn't need gensim (or pandas) at all. BTW.. I think you also want `PROPN` not `PRON` in your `alllowed_postags`. — bivouac0, Nov 28 '20 at 20:58
for ``spacy.load('en')`` may be worth referring this [post](https://stackoverflow.com/questions/49964028/spacy-oserror-cant-find-model-en). for example, ``spacy.load('en_core_web_sm')`` may be sufficient for some usecases. — Quetzalcoatl, Sep 18 '21 at 22:19

spacy aggressive lemmatization and removing unexpected words

1 Answers1