I work on an NLP and I need to lemmatize tons of tokens from raw input text file from 10MB to 300MB and I decided to use Inline::Python
with spacy
to do this task. The problem is that it's very slow. After this, I create bags of words to put in a cosine similarity module to classify texts from the past years. Is there a way to process faster, multi-processing, multi-threading, or is it the pipe to Python that is slow? And i have i9, 64GB RAM, RTX 2080TI and SSD connected by nvme.
Here is the piece of code to lemmatize in French some text content and filter stop words:
use Inline Python => <<'END_OF_PYTHON';
import spacy
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop
nlp = spacy.load('fr_core_news_md')
nlp.max_length = 40000000
def lemmatizer(words):
doc = nlp(words)
return list(filter(lambda x: x not in list(fr_stop), list(map(lambda token: token.lemma_ , doc))))
END_OF_PYTHON
Unfortunately, there is no good French lemmatizer in Perl and the lemmatization increases my accuracy to classify text files in good categories by 5%. It's important when you have already 90% good results without it. In this piece of code, I only use the function lemmatizer
in Perl after this. I don't reload each time the nlp spacy
module in French (I think ?)
I thought about creating one thread per file. I have 15 big text files to lemmatize. One file per category from the recent years. But imo, the I/O is the problem. Do you have some ideas? I can't show more code because there are 1500 lines. I need 1000 seconds to process automatic classification with the smallest category (50/60 files from the current year). The biggest is 10x bigger than the smallest.