2

I am trying to use my other cores in my python program. And the following is the basic structure/logic of my code:

import multiprocessing as mp
import pandas as pd
import gc

def multiprocess_RUN(param):
    result = Analysis_Obj.run(param)
    return result

class Analysis_Obj():

    def __init__(self, filename):
        self.DF = pd.read_csv(filename)
        return

    def run_Analysis(self, param):
        # Multi-core option
        pool = mp.Pool(processes=1)
        run_result = pool.map(multiprocess_RUN, [self, param])

        # Normal option
        run_result = self.run(param)

        return run_result

    def run(self, param):

        # Let's say I have written a function to count the frequency of 'param' in the target file
        result = count(self.DF, param)
        return result

if __name__ == "__main__":
    files = ['file1.csv', 'file2.csv']
    params = [1,2,3,4]
    results = []

    for i in range(0,len(files)):
        analysis = Analysis_Obj(files[i])
        for j in range(0,len(params)):
            result = analysis.run_Analysis(params[j])
            results.append(result)
        del result
    del analysis
    gc.collect()

If I comment out the 'Multi-core option' and run the 'Normal option' everything runs fine. But even if I run the 'Multi-core option' with processes=1 I get a Memory Error when my for loop starts on the 2nd file. I have deliberately set it up so that I create and delete an Analysis object in each for loop, so that the file that has been processed will be cleared from memory. Clearly this hasn't worked. Advice of how to get around this would be very much appreciated.

Cheers

EDIT:

Here is the error message I have in the terminal:

Exception in thread Thread-7:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 326, in _handle_workers
    pool._maintain_pool()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 230, in _maintain_pool
    self._repopulate_pool()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 223, in _repopulate_pool
    w.start()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 130, in start
    self._popen = Popen(self)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 121, in __init__
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
Jason Tam
  • 55
  • 8
  • What is the exact error you get? Can you post the stack trace? – pstatix Nov 10 '17 at 00:20
  • @pstatix I have just added it to the main post, thanks in advance – Jason Tam Nov 10 '17 at 00:38
  • There are a lot of errors here. The run method need a self argument. The map call is being applied to self then to param. The for j in range can be replaced with for p in param. I can't see where the run_Analisys method is called in the program. – geckos Nov 10 '17 at 00:38
  • The multiprocess_RUN calls run method as if was a static method but it isn't? The body of run method refers to self when there is no self in the parameters – geckos Nov 10 '17 at 00:42
  • My advise. Make this work with one process. Isolate the logic in a function. Make it work with the built-in map function, add the parallel map to it. – geckos Nov 10 '17 at 00:45
  • The code is isn't valid (due to indenting). Also, a match between an **exception trace** and **lines of code** would be more than helpful. Don't know what _analysis_ does. – CristiFati Nov 10 '17 at 00:50
  • Make a working example. Skip pandas, data files, and just process some lists or something that is a simple, tangible example. Cut-n-paste the exact, code, use the `{}` format button to indent the whole thing properly, post expected input and actual output including tracebacks. Make something it is easy for someone to debug and you'll get good answers instead of annoyed comments. – Mark Tolonen Nov 10 '17 at 01:02
  • @geckos Thanks for pointing out the mistakes. As for the way I have written multiprocess_RUN, I have taken ideas from here: http://www.rueckstiess.net/research/snippets/show/ca1d7d90 which seems to work perfectly fine to me. As for the suggestions involving parallel maps, do you mind showing an example? Cheers – Jason Tam Nov 10 '17 at 01:53
  • @MarkTolonen Thanks for the input. I would make an example that would reproduce the error if I could, but performing the same logic (even keeping the dataframes involved) with simple lists still works fine. The several mistakes on the original code I have posted has now been removed, and indentations are where they need to be as far as I can tell. – Jason Tam Nov 10 '17 at 01:59
  • Basically you are trying to parallelize the reading of files, I don't really know what the read_csv does but usually parallelization is applied to CPU bound stuff. Even if it works you will end up with multiple processes waiting for disk... Anyway, the multiprocessing docs has a parallel map example. – geckos Nov 10 '17 at 02:12
  • I saw the post you quoted. It seems that you can't call instance methods with parallel map, so the author isolate the parallel map call in a function. If you look closely you will see `[self]*len(x)` this creates a list of `len(x)` selfs and zip it to the parameters. zip will return a list with pairs of it's parameters. This is totally different from what you do. – geckos Nov 10 '17 at 02:19
  • @geckos Thanks for your efforts. I understand that part, and how mine is different to it. This is done on purpose applying the exact methods still bring the same memory issues, I have modified it with the intention that the multiprocessing is used one small step at a time, and deleting everything that is no longer needed from memory immediately after, in an attempt to solve the issue – Jason Tam Nov 10 '17 at 02:25
  • Take a look at this answer to see why you can't use parallel map on instance methods for python 2 https://stackoverflow.com/q/27318290/652528 – geckos Nov 10 '17 at 02:26
  • I guess that's is a lot of forks going on, don't? – geckos Nov 10 '17 at 02:37
  • The easiest way to indent correctly (it still wasn't) is to cut-n-paste your working script, highlight it, and use the `{}` formatting button to format as code. To make a working example we can reproduce, we need the sample data files as well. Where's the code for the `count` function? – Mark Tolonen Nov 10 '17 at 03:02

0 Answers0