8

I have the following problem. My purpose is to process a bunch of documents (bring all words to normal form, e.g. 'was' --> 'be', 'were' --> 'be', 'went' --> 'go'). Which means, I need to open each file in a directory, change its content and save it in the other directory.
Since the process is time-consuming, I decided to parallel it with the help of joblib. The code below works properly (I mean, it performs what it has to), but I faced a huge problem with memory.
It keeps growing constantly!
It grows until there's no memory left on the server at all.

from joblib import delayed, Parallel

def process_text(text):
    # some function which processes
    # text and returns a new text
    return processed_text


def process_and_save(document_id):
    with open(path + document_id) as f:
        text = f.read()
    text = process_text(text)
    f = open(other_path + document_id, 'w')
    f.write(text)
    f.close()

all_doc_ids = # a list of document ids which I need to process

Parallel(n_jobs=10)(delayed(process_and_save)(doc_id) for doc_id in all_doc_ids)

I've also tried to change joblib into multipricessing:

pool = Pool(10)
pool.map(process_and_save, all_doc_ids) 

But the situation turned out to be exactly the same.

Are there any ways to solve the problem? And, of course, the main question is, why is this even happening?

Thank you!

P.S. The documents are quite small and the process consumes very little memory when running without parallelism.

fremorie
  • 713
  • 2
  • 9
  • 20
  • For multiprocessing you can explicitly terminate all the spawned processed. For joblib I have the same problem – Ivan Sudos May 11 '21 at 22:07

2 Answers2

1

It seem this memory leak issue has been resolved on the last version of Joblib.

They introduce loky backend as memory leaks safeguards.

Parallel(n_jobs=10, backend='loky')(delayed(process_and_save)(doc_id) for doc_id in all_doc_ids)

source: Memory Release after parallel

chris
  • 89
  • 3
0

When you work with all documents in parallel each thread is loading in memory the whole files because read() creates a string from the entire file in memory.

As a workaround you can read the files in chunks. See Lazy Method for Reading Big File in Python?

staticdev
  • 2,950
  • 8
  • 42
  • 66