I'm writing a program that loads a huge data set (4.G of RAM usage) and writes the data from memory to files. The program will then write the same set of data but in different sequences to different files.
For example:
data: 1, 2, 3, 4 (but in my program, the data is recorded in a dictionary and has a much bigger size)
outfileA: 1, 2, 4, 3 outfileB: 4, 2, 2, 1 .......
I tried to use multiprocessing to speed up my process and resample the index of the array when writing to file. However, each subprocess I used will take the additional memory. For example, the data set will take 4.5G of RAM, and each subprocess will take additional 4G of RAM when running. The weird thing is when I do not use the multiprocess, the writing process doesn't use additional memory at all. Below is a simple example that explains my question (doesn't include the resample part):
I'm running my code on macOS
import multiprocessing as mp
di = {}
l1 = ['a','b','c','d']
l2 = ['b','b','c','e']
for i in range(0,2):
name = "abc"+str(i)
di.update({name:[l1,l2]})
def writ(dirname):
outfile = open(dirname,'w')
for i in di:
for object in di[i][0]:
outfile.write(object)
p1 = mp.Process(target = writ,args=('d1',))
p2 = mp.Process(target = writ,args=('d2',))
p1.start()
p2.start()
In my real program, the data dictionary takes 4.5G of RAM and each subprocess will take additional 4G of RAM when running. However, that's not the case when I do not use the multiprocess and just call the function. If I just call the function, it will not take additional memory. It then confused me since it only read the saved memory and write the memory to files. That shouldn't take additional memory. I think similar additional memory usage also happens to the above sample code.