Python Pickling and multiprocessing

Question

I'm trying to use multiprocessing to get a handle on my memory issues, however I can't get a function to pickle, and I have no idea why. My main code starts with

def main():
    print "starting main"
    q = Queue()
    p = Process(target=file_unpacking,args=("hellow world",q))
    p.start()
    p.join()
    if p.is_alive():
        p.terminate()
    print "The results are in"
    Chan1 = q.get()
    Chan2 = q.get()
    Start_Header = q.get()
    Date = q.get()
    Time = q.get()
    return Chan1, Chan2, Start_Header, Date, Time

def file_unpacking(args, q):
    print "starting unpacking"
    fileName1 = "050913-00012"
    unpacker = UnpackingClass()
    for fileNumber in range(0,44):
        fileName = fileName1 + str(fileNumber) + fileName3
        header, data1, data2 = UnpackingClass.unpackFile(path,fileName)

        if header == None:
            logging.warning("curropted file found at " + fileName)
            Data_Sets1.append(temp_1)
            Data_Sets2.append(temp_2)
            Headers.append(temp_3)
            temp_1 = []
            temp_2 = []
            temp_3 = []
            #for i in range(0,10000):
            #    Chan1.append(0)
            #    Chan2.append(0)

        else:
            logging.info(fileName + " is good!")
            temp_3.append(header)
            for i in range(0,10000):
                temp_1.append(data1[i])
                temp_2.append(data2[i])

    Data_Sets1.append(temp_1)
    Data_Sets2.append(temp_2)
    Headers.append(temp_3)
    temp_1 = []
    temp_2 = []
    temp_3 = []

    lengths = []
    for i in range(len(Data_Sets1)):
        lengths.append(len(Data_Sets1[i]))
    index = lengths.index(max(lengths))

    Chan1 = Data_Sets1[index]
    Chan2 = Data_Sets2[index]
    Start_Header = Headers[index]
    Date = Start_Header[index][0]
    Time = Start_Header[index][1]
    print "done unpacking"
    q.put(Chan1)
    q.put(Chan2)
    q.put(Start_Header)
    q.put(Date)
    q.put(Time)

and currently I have the unpacking method in a separate python file that imports struct and os. This reads a part text part binary file, structures it, and then closes it. This is mostly leg work, so I won't post it yet, however if it helps I will. I will give the start

class UnpackingClass:
    def __init__(self):
        print "Unpacking Class"
    def unpackFile(path,fileName):
        import struct
        import os
    .......

Then I simply call main() to get the party started, and I get nothing but a infinite loop of pickle errors.

Long story short I don't have any clue how to pickle a function. Everything is defined at the top of files, so I'm at a loss.

Here is the error message

Traceback (most recent call last):
 File "<string>", line 1, in <module>
 File "C:\Users\Casey\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.1.0.1371.win-x86_64\lib\multiprocessing\forking.py", line 373, in main
prepare(preparation_data)
 File "C:\Users\Casey\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.1.0.1371.win-x86_64\lib\multiprocessing\forking.py", line 488, in prepare
'__parents_main__', file, path_name, etc
 File "A:\598\TestCode\test1.py", line 142, in <module>
Chan1, Chan2, Start_Header, Date, Time = main()
 File "A:\598\TestCode\test1.py", line 43, in main
p.start()
  File "C:\Users\Casey\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.1.0.1371.win-x86_64\lib\multiprocessing\process.py", line 130, in start
self._popen = Popen(self)
  File "C:\Users\Casey\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.1.0.1371.win-x86_64\lib\multiprocessing\forking.py", line 271, in __init__
dump(process_obj, to_child, HIGHEST_PROTOCOL)
  File "C:\Users\Casey\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.1.0.1371.win-x86_64\lib\multiprocessing\forking.py", line 193, in dump
ForkingPickler(file, protocol).dump(obj)
  File "C:\Users\Casey\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.1.0.1371.win-x86_64\lib\pickle.py", line 224, in dump
self.save(obj)
  File "C:\Users\Casey\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.1.0.1371.win-x86_64\lib\pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
 File "C:\Users\Casey\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.1.0.1371.win-x86_64\lib\pickle.py", line 419, in save_reduce
save(state)
 File "C:\Users\Casey\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.1.0.1371.win-x86_64\lib\pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
 File "C:\Users\Casey\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.1.0.1371.win-x86_64\lib\pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
 File "C:\Users\Casey\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.1.0.1371.win-x86_64\lib\pickle.py", line 681, in _batch_setitems
save(v)
 File "C:\Users\Casey\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.1.0.1371.win-x86_64\lib\pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "C:\Users\Casey\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.1.0.1371.win-x86_64\lib\pickle.py", line 748, in save_global
(obj, module, name))
pickle.PicklingError: Can't pickle <function file_unpacking at 0x0000000007E1F048>: it's    not found as __main__.file_unpacking

The error seems straightforward to me: the function `__main__.file_unpacking` needs to be defined when you unpack. The `__main__` prefix means that `file_unpacking` should be defined at the top level in the main script. But, really, trying to pickle a function is a bad idea unless you really know what you're doing. — jrennie, Mar 15 '14 at 18:48
@bobruels44 - I am curious about your statement `I'm trying to use multiprocessing to get a handle on my memory issues`. Can you elaborate? I do not follow why multiprocessing would consume less memory. — Roberto, Mar 22 '14 at 18:10
I'm running a script that opens up data, processes it, and then plots it. I want it to use a few thousand data sets. After about 60ish, it uses up around 14GB of ram. someone in another section to thread it, and let the OS handle the memory issues. — Cate Daniel, Mar 22 '14 at 23:57

score 5 · Answer 1 · edited May 23 '17 at 10:31

Pickling a function is a very very relevant thing to do if you want to do any parallel computing. Python's pickle and multiprocessing are pretty broken for doing parallel computing, so if you aren't adverse to going outside of the standard library, I'd suggest dill for serialization, and pathos.multiprocessing as a multiprocessing replacement. dill can serialize almost anything in python, and pathos.multiprocessing uses dill to provide more robust parallel CPU use. For more information, see:

What can multiprocessing and dill do together?

or this simple example:

Python 2.7.6 (default, Nov 12 2013, 13:26:39) 
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> from pathos.multiprocessing import ProcessingPool
>>> 
>>> def squared(x):
...   return x**2
... 
>>> pool = ProcessingPool(4)
>>> pool.map(squared, range(10))
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
>>> res = pool.amap(squared, range(10))
>>> res.get()
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
>>> res = pool.imap(squared, range(10))
>>> list(res)
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
>>> 
>>> def add(x,y):
...   return x+y
... 
>>> pool.map(add, range(10), range(10))
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
>>> res = pool.amap(add, range(10), range(10))
>>> res.get()
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
>>> res = pool.imap(add, range(10), range(10))
>>> list(res)
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

Both dill and pathos are available here: https://github.com/uqfoundation

Great job with dill and pathos. The only critique I have for you is that pathos is a huge dependency to have, including all of its own dependencies, if all you need is `pathos.multiprocessing`. — JoErNanO, Oct 28 '14 at 17:25

score 0 · Answer 2 · answered Mar 15 '14 at 02:50

0

You can technically pickle a function. But, it's only a name reference that's being saved. When you unpickle, you must set up the environment so that the name reference makes sense to python. Make sure to read What can be pickled and unpicked carefully.

If this doesn't answer your question, you'll need to provide us with the exact error messages. Also, please explain the purpose of pickling a function. Since you can only pickle a name reference and not the function itself, why can't you simply import and call the corresponding code?

answered Mar 15 '14 at 02:50

jrennie

1,937
12
16

I think it should be, my entire Class/Module is one (albeit long) method. should that not be at the top? I'm pickling to try to cut on memory usage. This will be apart of a larger script to plot a bunch of data sets, and currently I have to do one day at a time because otherwise I burn through 16GB ram – Cate Daniel Mar 15 '14 at 02:57

Python Pickling and multiprocessing

2 Answers2