3

I work on an NLP and I need to lemmatize tons of tokens from raw input text file from 10MB to 300MB and I decided to use Inline::Python with spacy to do this task. The problem is that it's very slow. After this, I create bags of words to put in a cosine similarity module to classify texts from the past years. Is there a way to process faster, multi-processing, multi-threading, or is it the pipe to Python that is slow? And i have i9, 64GB RAM, RTX 2080TI and SSD connected by nvme.

Here is the piece of code to lemmatize in French some text content and filter stop words:

use Inline Python => <<'END_OF_PYTHON';

import spacy
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop
nlp = spacy.load('fr_core_news_md')
nlp.max_length = 40000000

def lemmatizer(words):
    doc = nlp(words)
    return list(filter(lambda x: x not in list(fr_stop), list(map(lambda token: token.lemma_ , doc))))

END_OF_PYTHON

Unfortunately, there is no good French lemmatizer in Perl and the lemmatization increases my accuracy to classify text files in good categories by 5%. It's important when you have already 90% good results without it. In this piece of code, I only use the function lemmatizer in Perl after this. I don't reload each time the nlp spacy module in French (I think ?)

I thought about creating one thread per file. I have 15 big text files to lemmatize. One file per category from the recent years. But imo, the I/O is the problem. Do you have some ideas? I can't show more code because there are 1500 lines. I need 1000 seconds to process automatic classification with the smallest category (50/60 files from the current year). The biggest is 10x bigger than the smallest.

sophros
  • 14,672
  • 11
  • 46
  • 75
Etienne Armangau
  • 255
  • 2
  • 10
  • 1
    Adding parallelism to an I/O bound task will only make it slower, unless you have parallel disks with individual data buses. – tripleee May 05 '21 at 08:12
  • 3
    I don't know enough about `Inline::Python` to say whether there is any easy way to fix this, but explicitly building a `list` of the lemmas causes everything to be stored in memory. Can you refactor this to a generator which `yield`s one token at a time, and then releases the memory? – tripleee May 05 '21 at 08:15
  • It's a good idea but stupid question, how to get back my lemmatized word when i have a generator from python function filter ? In Perl ? It's the first time that i mix Python with Perl. It's pretty hard to understand to be honest. I know there are generators in perl. Should i use them ? – Etienne Armangau May 05 '21 at 16:58
  • How much actual text do you have, in words or MB? – polm23 May 06 '21 at 11:33
  • between 450 and 500Mo – Etienne Armangau May 06 '21 at 13:09

1 Answers1

2

There is a number of speed improvements that you could try:

  1. Using yield (actually yield from) instead of constructing the list in memory before returning it. Also, I don't think you need to create a list from the results from map:
def lemmatizer(words):
    doc = nlp(words)
    yield from filter(lambda x: x not in list(fr_stop), map(lambda token: token.lemma_, doc))
  1. Using a set instead of a list for containment checking:
fr_stop = set(fr_stop)
def lemmatizer(words):
    doc = nlp(words)
    yield from filter(lambda x: x not in fr_stop, map(lambda token: token.lemma_ , doc))

These should help reducing both processing time and memory pressure.

sophros
  • 14,672
  • 11
  • 46
  • 75
  • Thx you guys, i 'll try your solution with yield and i'll come back to you. But in any case multi threading or multiprocessing is the solution ? – Etienne Armangau May 05 '21 at 14:30
  • Try first what is the single-threaded performance to know if you are IO-bound with multi-threading (then the performance would not scale close to linearly with the number of cores. – sophros May 05 '21 at 15:26
  • Also, if you like the answer, please accept it. – sophros May 05 '21 at 15:26
  • I did it. How to convert generator object to string ? Should i use perl generators too ? – Etienne Armangau May 05 '21 at 17:00
  • Why the question? You may want to consult some SO answers on similar matter, like [this one](https://stackoverflow.com/questions/2419770/how-to-get-one-value-at-a-time-from-a-generator-function-in-python) – sophros May 05 '21 at 18:48