How to reduce memory consumption by a python process performing a large number of iterations using subprocesses

Question

I have a python script, which needs to run a loop about 50000 times. I defined a separate function to perform the data processing so as to free up the memory by removing the local variables of the function once it's done. Additionally, the total size of all objects in my script is less than about 300 Mb, yet the total memory consumption shown by “top” increases to multiple Gbs after about only a few hundred iterations. A sample script is given below:

from random import randint
from copy import deepcopy

class dataHolder(object):
     """
     This is a sample class for holding data. 
     The actual class in my script holds lots of
     different data types
     """
     def __init__(self,x,y):
         self.x = x
         self.y = y

def sub_process(data,x_change , y_change):
    # This is a function processing the input arguments and writes the results 
    # into a text file. The following operations is just a sample of what I have in my actual script
    # create a random index
    x_ind = randint(1,len(data.x)) - 1
    y_ind = randint(1,len(data.y)) - 1

    data.x[x_ind] += x_change
    data.y[y_ind] += y_change

    # Write the results into file
    with open('test.txt','w') as f:
        f.write('x[%i] = %i   y[%i] = %i\n' % (x_ind,data.x[x_ind],y_ind,data.y[y_ind]))

def master_func(start_pos):
    # This is an example of the main function in my actual script, 
    # which creates data used as input arguments
    # for function sub_process(). 
    # The approximate size of the input data 
    # for sub_process() is 320 Mb in my actual script, 
    # so in the following example I create data objects with 
    # approximately the same size. The following data are constant
    # for all iterations
    mydata = dataHolder(x = range(5000000), y = range(start_pos,5000000 + start_pos))

    # This is the main loop that must be run 50000 times
    for k in range(50000):
        # x_change and y_change vary from one iteration to another
        x_change = randint(1,10)
        y_change = randint(1,10)

        # Perform the data processing in a separate function
        sub_process(data = deepcopy(mydata), x_change = x_change, y_change = y_change)

if __name__ == "__main__":
    master_func(start_pos = 2)

Following the suggestion given here I am trying to use sub-processes to resolve the memory issue but now quite sure hot to put it into the context for this particular problem. Any suggestions is greatly appreciated.

EDIT: The following modification of the sample code above resolved the memory issue:

from random import randint
from copy import deepcopy
from multiprocessing import Process

class dataHolder(object):
     def __init__(self,x,y):
         self.x = x
         self.y = y

def sub_process(input_data):
    data = deepcopy(input_data['data'])
    x_change = input_data['x_change']
    y_change = input_data['y_change']

    x_ind = randint(1,len(data.x)) - 1
    y_ind = randint(1,len(data.y)) - 1

    data.x[x_ind] += 1
    data.y[y_ind] += 1
    with open('test.txt','w') as f:
        f.write('x[%i] = %i   y[%i] = %i\n' % (x_ind,data.x[x_ind],y_ind,data.y[y_ind]))

def master_func(start_pos):
    input_data = {}
    input_data['data'] = dataHolder(x = range(5000000), y = range(start_pos,5000000 + start_pos)) 

    # This is the main loop that must be run 50000 times
    for k in range(50000):            
        input_data['x_change'] = randint(1,10) x_change
        input_data['y_change'] = randint(1,10)

        p = Process(target = sub_process, args = (input_data,))
        p.start()
        p.join() 

if __name__ == "__main__":
    master_func(start_pos = 2)

The suggestion you think you're following meant to use the [`subprocess`](https://docs.python.org/2/library/subprocess.html#module-subprocess) **module** to do the processing, not just create a function and call like you're doing. — martineau, Jan 15 '15 at 22:27
One issue you're going to have to solve to use the `subprocess` module is figuring out a way to pass all that input data to it. In you sample code it looks like this is essentially a constant so doing so may not even be necessary -- by which I mean your subprocess script can just create it for itself or read it in from a file. — martineau, Jan 15 '15 at 22:34
Creating the input data takes some time and having it in the sub_process can slow down the iterations significantly. Reading data from a file is not possible because dataHolder in my actual script contains some methods and is not pickle-able. Furthermore, since these input data are constant for all iterations, it doesn't seem to be reasonable to make them every time that the loop runs. In my actual script the sub_process takes some input data that are constant (like mydata) and some others which change from one iteration to another. I'll modify the example code to correctly show this. — user3076813, Jan 16 '15 at 00:07
So, I changed the example code as follows to use subprocessing, but not sure if this is the correct way of addressing the memory issue: — user3076813, Jan 16 '15 at 00:24
You're still not using the `subprocess` module, so what you're doing likely won't prevent the one process that is running from consuming more and more memory. The whole point of using a separate process is that when it finishes all memory it consumed is freed. The problem with you using that approach is that you want to share data between it and the `master_func()` but they each will execute in their own address spaces so the problem of data sharing is raised. — martineau, Jan 16 '15 at 00:42
The key to solving this dilemma will likely be figuring-out how to divide the problem up into two (or more) tasks which are as independent from one another as possible, thereby reducing the need to pass large amounts of data between or among them. — martineau, Jan 16 '15 at 00:49
Sorry for the confusion. I did use multiprocessing in the modified code and meant to post it as part of my previous comment, but it seems that it hasn't appeared there. So, I added an EDIT to the question and included the modified code there. — user3076813, Jan 16 '15 at 02:43
In the edited code I just followed the simple example given [here] (http://stackoverflow.com/questions/23937189/how-do-i-use-subprocesses-to-force-python-to-release-memory/24126616#24126616) as a guide. All what I did was just modifying master_func() so I use process manager and also adjusting sub_process() accordingly. Am I still completely off with using subprocesses? I am very naive with python, so any suggestions on improving the question is welcome! — user3076813, Jan 16 '15 at 14:43
If the version using the `multiprocessing` module works and avoids the memory consumption issue, I think you're done. You could ask a new question about whether there's any better method for passing real data (which won't be the same every time like is in your example) from the `master_fuct()` process to the ` sub_process()`. — martineau, Jan 16 '15 at 17:36
have you tried [gc.collect](https://docs.python.org/3/library/gc.html#gc.collect) inside the loop? — maxy, Jan 20 '15 at 18:00

How to reduce memory consumption by a python process performing a large number of iterations using subprocesses

0 Answers0