0

I'm writing a program that loads a huge data set (4.G of RAM usage) and writes the data from memory to files. The program will then write the same set of data but in different sequences to different files.

For example:

data: 1, 2, 3, 4 (but in my program, the data is recorded in a dictionary and has a much bigger size)

outfileA: 1, 2, 4, 3 outfileB: 4, 2, 2, 1 .......

I tried to use multiprocessing to speed up my process and resample the index of the array when writing to file. However, each subprocess I used will take the additional memory. For example, the data set will take 4.5G of RAM, and each subprocess will take additional 4G of RAM when running. The weird thing is when I do not use the multiprocess, the writing process doesn't use additional memory at all. Below is a simple example that explains my question (doesn't include the resample part):

I'm running my code on macOS

import multiprocessing as mp

di = {}
l1 = ['a','b','c','d']
l2 = ['b','b','c','e']

for i in range(0,2):
    name = "abc"+str(i)
    di.update({name:[l1,l2]})

def writ(dirname):
    outfile = open(dirname,'w')
    for i in di:
        for object in di[i][0]:
            outfile.write(object)

p1 = mp.Process(target = writ,args=('d1',))
p2 = mp.Process(target = writ,args=('d2',))

p1.start()
p2.start()

In my real program, the data dictionary takes 4.5G of RAM and each subprocess will take additional 4G of RAM when running. However, that's not the case when I do not use the multiprocess and just call the function. If I just call the function, it will not take additional memory. It then confused me since it only read the saved memory and write the memory to files. That shouldn't take additional memory. I think similar additional memory usage also happens to the above sample code.

  • 1
    `multiprocessing` uses *multiple python processes*. It creates *copies* of the data from the parent process. Since you are using an I/O bound task, this would actually be a good use-case for the `threading` module instead, although your resample might complicate this. – juanpa.arrivillaga Jul 16 '18 at 04:46
  • @juanpa.arrivillaga I read some other post and realized that macOS wouldn't copy the data from parent process since it uses the copy-on-write process. Please correct me if im wrong – Unsalted Fish Jul 16 '18 at 04:55
  • Please show us your sources! – Klaus D. Jul 16 '18 at 05:05
  • @KlausD. This is where I read the COW process, and I also update the code. https://stackoverflow.com/questions/14749897/python-multiprocessing-memory-usage – Unsalted Fish Jul 16 '18 at 05:18
  • Every time you reference a Python object you potentially increase it's reference count, thus COW will rarely benefit you if you use Python, unless you work with some low-level libraries and take special care. Indeed, this very thing is discussed in the link you provided. Your `dict` objects will definitely be copied here, and that is *expected* – juanpa.arrivillaga Jul 16 '18 at 06:07
  • COW means that the memory won't be actually copied, unless needed. It still is allocated for the process though. Now, actually if you do not run out of memory to the point of failing... and you do not actually change the large data read in... you might be just fine as long as you do not let it bother you that your program is reported being used (and something else doesn't need that memory). – Ondrej K. Jul 16 '18 at 11:00

0 Answers0