I need, in a data analysis python project, to use both classes and multiprocessing features, and I haven't found a good example of it on Google.
My basic idea - which is probably wrong - is to create a class with a high-size variable (it's a pandas dataframe in my case), and then to define a method which computes an operation (a sum in this case).
import multiprocessing
import time
class C:
def __init__(self):
self.__data = list(range(0, 10**7))
def func(self, nums):
return sum(nums)
def start_multi(self):
for n_procs in range(1, 4):
print()
time_start = time.clock()
chunks = [self.__data[(i-1)*len(self.__data)// n_procs: (i)*len(self.__data)// n_procs] for i in range(1, n_procs+1)]
pool = multiprocessing.Pool(processes=n_procs)
results = pool.map_async(self.func, chunks )
results.wait()
pool.close()
results = results.get()
print(sum(results))
print("n_procs", n_procs, "total time: ", time.clock() - time_start)
print('sum(list(range(0, 10**7)))', sum(list(range(0, 10**7))))
c = C()
c.start_multi()
The code doesn't work properly: I get the following print output
sum(list(range(0, 10**7))) 49999995000000
49999995000000
n_procs 1 total time: 0.45133500000000026
49999995000000
n_procs 2 total time: 0.8055279999999954
49999995000000
n_procs 3 total time: 1.1330870000000033
that is the computation time increases instead of decreasing. So, which is the error in this code?
But I'm also worried by the RAM usage since, when the variable chunks is created, the self.__data RAM usage is doubled. Is it possible, when dealing with multiprocessing code, and more specifically in this code, to avoid this memory waste? (I promise I'll put everything on Spark in the future :) )