0

I am loading a spacy dictionary for the purposes of lemmatizing multiple documents. I am using Gnu Parallel to use my lemmatization script on multiple documents (over 1000 documents) to speed the lemmatization up. However loading the spacy dictionary is a very costly step, which I would ideally like to load once and share across all the processes. Below is the dictionary I am loading.

The following questions ask similar to what I am asking, but there are no conclusive answers.

Sharing shared object between multiple processes

Would it be possible to share some memory with GNU Parallel?

nlp = spacy.load('en', disable=['parser', 'ner'])

ls -d -1 /home/ndg/arc/reddit/2015/RC_2015-[0][1-5]*.gz | parallel -j20 --pipe parallel -j100 --no-notice python lemmatize_subreddit_posts.py
Abhinav Bhandari
  • 65
  • 1
  • 2
  • 5

0 Answers0