I have the following problem.
My purpose is to process a bunch of documents (bring all words to normal form, e.g. 'was' --> 'be', 'were' --> 'be', 'went' --> 'go').
Which means, I need to open each file in a directory, change its content and save it in the other directory.
Since the process is time-consuming, I decided to parallel it with the help of joblib.
The code below works properly (I mean, it performs what it has to), but I faced a huge problem with memory.
It keeps growing constantly!
It grows until there's no memory left on the server at all.
from joblib import delayed, Parallel
def process_text(text):
# some function which processes
# text and returns a new text
return processed_text
def process_and_save(document_id):
with open(path + document_id) as f:
text = f.read()
text = process_text(text)
f = open(other_path + document_id, 'w')
f.write(text)
f.close()
all_doc_ids = # a list of document ids which I need to process
Parallel(n_jobs=10)(delayed(process_and_save)(doc_id) for doc_id in all_doc_ids)
I've also tried to change joblib into multipricessing:
pool = Pool(10)
pool.map(process_and_save, all_doc_ids)
But the situation turned out to be exactly the same.
Are there any ways to solve the problem? And, of course, the main question is, why is this even happening?
Thank you!
P.S. The documents are quite small and the process consumes very little memory when running without parallelism.