0

I'm working on developing a little irc client in python (ver 2.7). I had hoped to use multiprocessing to read from all servers I'm currently connected to, but I'm running into an issue

import socket
import multiprocessing as mp
import types
import copy_reg
import pickle


def _pickle_method(method):
    func_name = method.im_func.__name__
    obj = method.im_self
    cls = method.im_class
    return _unpickle_method, (func_name, obj, cls)

def _unpickle_method(func_name, obj, cls):
    for cls in cls.mro():
        try:
            func = cls.__dict__[func_name]
        except KeyError:
            pass
        else:
            break
    return func.__get__(obj, cls)

copy_reg.pickle(types.MethodType, _pickle_method, _unpickle_method)

class a(object):

    def __init__(self):
        sock1 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock1.connect((socket.gethostbyname("example.com"), 6667))
        self.servers = {}
        self.servers["example.com"] = sock1

    def method(self, hostname):
        self.servers[hostname].send("JOIN DAN\r\n")
        print "1"

    def oth_method(self):
        pool = mp.Pool()
        ## pickle.dumps(self.method)
        pool.map(self.method, self.servers.keys())
        pool.close()
        pool.join()

if __name__ == "__main__":
    b = a()
    b.oth_method()

Whenever it hits the line pool.map(self.method, self.servers.keys()) I get the error

TypeError: expected string or Unicode object, NoneType found

From what I've read this is what happens when I try to pickle something that isn't picklable. To resolve this I first made the _pickle_method and _unpickle_method as described here. Then I realized that I was (originally) trying to pass pool.map() a list of sockets (very not picklable) so I changed it to the list of hostnames, as strings can be pickled. I still get this error, however.

I then tried calling pickle.dumps() directly on self.method, self.servers.keys(), and self.servers.keys()[0]. As expected it worked fine for the latter two, but from the first I get

TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled.

Some more research lead me to this question, which seems to indicate that the issue is with the use of sockets (and gnibbler's answer to that question would seem to confirm it).

Is there a way that I can actually use multiprocessing for this? From what I've (very briefly) read pathos.multiprocessing might be what I need, but I'd really like to stick to the standard library if at all possible.

I'm also not set on using multiprocessing - if multithreading would work better and avoid this issue then I'm more than open to those solutions.

Community
  • 1
  • 1
Dan Oberlam
  • 2,435
  • 9
  • 36
  • 54
  • Are you actually trying to pass a socket to the child process, or is that just something that's happening accidentally that you're trying to avoid? For the former, you need to migrate sockets, which has to be done at a lower level than Python pickling, and it's different for each platform, because under the covers a socket is just a wrapper around a file descriptor, and you need the OS to make the same file descriptor mean the same socket in your child process. – abarnert Sep 29 '14 at 08:20
  • Meanwhile, is there a reason you're using multiprocessing instead of multithreading in the first place? "Reading from a bunch of servers" is about as close as you can get to a paradigm case for I/O-bound, which is exactly what threads are good for. – abarnert Sep 29 '14 at 08:21
  • No, I'm passing the child process the string key to the dictionary that refers to the socket. The child process then uses the string key to access the socket, do sockety stuff, then return. The reason I'm using multiprocessing instead of multithreading is because I'm new to multianything and I read that threading is slow in python. That being said I'm very open to multithreading solutions – Dan Oberlam Sep 29 '14 at 08:22
  • 1
    First, "threading is slow in Python" is not true. Threading is slow in Python _if you have CPU-bound code_, because only one thread can execute instructions at the same time. If your threads spend almost all of their time waiting on a socket recv or similar, there is no problem with threads, and processes just add overhead and complexity for no benefit. – abarnert Sep 29 '14 at 08:25
  • 1
    Second, "I'm passing the child process the string key to the dictionary that refers to the socket". So… how does it get that dictionary? Is this Unix-specific code that depends on the child inheriting the parent's state at startup? If so, then why do you think the problem has anything to do with pickling sockets? If not, then why do you think socket migration isn't necessary? – abarnert Sep 29 '14 at 08:27
  • 2
    Meanwhile, the answer to the question you linked says that using protocol -1 instead of the default will solve the problem. Have you tried that? If so, what happened? – abarnert Sep 29 '14 at 08:28
  • This is on Windows. And the dictionary is an instance attribute, so I assumed that the child process would also have access to them. I thought the problem might have to do with pickling sockets as they're the only thing that jumps out to me as non-picklable, and from what I've read this is caused by trying to pickle something that can't be. To be honest I have no idea what socket migration is - it might be necessary, it might not. I didn't try using protocol -1 because I didn't think you could do that inside of pool.map(), making it irrelevant if that worked – Dan Oberlam Sep 29 '14 at 08:32
  • Tried using protocol -1. Got the following. `pickle.PicklingError: Can't pickle : it's not found as __main__.recvfrom_into` – Dan Oberlam Sep 29 '14 at 08:36
  • OK, that's a whole other problem. But the bigger problem is that you are actually trying to pass sockets around, and you can't do that. Let me write an answer. – abarnert Sep 29 '14 at 08:38

2 Answers2

3

Your root problem is that you can't pass sockets to child processes. The easy solution is to use threads instead.

In more detail:


Pickling a bound method requires pickling three things: the function name, the object, and the class. (I think multiprocessing does this for you automatically, but you're doing it manually; that's fine.) To pickle the object, you have to pickle its members, which in your case includes a dict whose values are sockets.

You can't pickle sockets with the default pickling protocol in Python 2.x. The answer to the question you linked explains why, and provides the simple workaround: don't use the default pickling protocol. But there's an additional problem with socket; it's just a wrapper around a type defined in a C extension module, which has its own problems with pickling. You might be able to work around that as well…

But that still isn't going to help. Under the covers, that C extension class is itself just a wrapper around a file descriptor. A file descriptor is just a number. Your operating system keeps a mapping of file descriptors to open sockets (and files and pipes and so on) for each process; file #4 in one process isn't file #4 in another process. So, you need to actually migrate the socket's file descriptor to the child at the OS level. This is not a simple thing to do, and it's different on every platform. And, of course, on top of migrating the file descriptor, you'll also have to pass enough information to re-construct the socket object. All of this is doable; there might even be a library that wraps it up for you. But it's not easy.


One alternate possibility is to open all of the sockets before launching any of the children, and set them to be inherited by the children. But, even if you could redesign your code to do things that way, this only works on POSIX systems, not on Windows.


A much simpler possibility is to just use threads instead of processes. If you're doing CPU-bound work, threads have problems in Python (well, CPython, the implementation you're almost certainly using) because there's a global interpreter lock that prevents two threads from interpreting code at the same time. But when your threads spend all their time waiting on socket.recv and similar I/O calls, there is no problem using threads. And they avoid all the overhead and complexity of pickling data and migrating sockets and so forth.

You may notice that the threading module doesn't have a nice Pool class like multiprocessing does. Surprisingly, however, there is a thread pool class in the stdlib—it's just in multiprocessing. You can access it as multiprocessing.dummy.Pool.

If you're willing to go beyond the stdlib, the concurrent.futures module from Python 3 has a backport named futures that you can install off PyPI. It includes a ThreadPoolExecutor which is a slightly higher-level abstraction around a pool which may be simpler to use. But Pool should also work fine for you here, and you've already written the code.

abarnert
  • 354,177
  • 51
  • 601
  • 671
1

If you do want to try jumping out of the standard library, then the following code for pathos.multiprocessing (as you mention) should not throw pickling errors, as the dill serializer knows how to serialize sockets and file handles.

>>> import socket
>>> import pathos.multiprocessing as mp
>>> import types
>>> import dill as pickle
>>>
>>> class a(object):
...    def __init__(self):
...        sock1 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
...        sock1.connect((socket.gethostbyname("example.com"), 6667))
...        self.servers = {}
...        self.servers["example.com"] = sock1
...    def method(self, hostname):
...        self.servers[hostname].send("JOIN DAN\r\n")
...        print "1"
...    def oth_method(self):
...        pool = mp.ProcessingPool()
...        pool.map(self.method, self.servers.keys())
...        pool.close()
...        pool.join()
...
>>> b = a()
>>> b.oth_method()

One issue however is that you need serialization with multiprocessing, and in many cases the sockets will serialize so that the deserialized socket is closed. The reason is primarily because the file descriptor isn't copied as expected, it's copied by reference. With dill you can customize the serialization of file handles, so that the content does get transferred as opposed to using a reference… however, this doesn't translate well for a socket (at least at the moment).

I'm the dill and pathos author, and I'd have to agree with @abarnert that you probably don't want to do this with multiprocessing (at least not storing a map of servers and sockets). If you want to use multiprocessing's threading interface, and you find you run into any serialization concerns, pathos.multiprocessing does have mp.ThreadingPool() instead of mp.ProcessingPool(), so that you can access a wrapper around multiprocessing.dummy.Pool, but still get the additional features that pathos provides (such as multi-argument pools for blocking or asynchronous pipes and maps, etc).

Mike McKerns
  • 33,715
  • 8
  • 119
  • 139
  • I'm curious how you're serializing file handles. It's not like `unix_sock.sendmsg(sock)` and `sock.share(pid)` are exactly hard to write, but wrapping them up in a process pool interface seems like a bit of a design headache. (Also, dealing with the fact that sockets and files are different things on Windows…) But that's probably off-topic to continue here, so I'll go download your module and read it instead. :) – abarnert Sep 29 '14 at 19:44
  • @abarnert: there's nothing smart done for sockets yet. There are several options (recently added on https://github.com/uqfoundation/dill and not in the current release) for dealing with serializing files… and it does need testing on Windows. I'm aware of the differences between Windows and any other OS. The fork of `multiprocessing` is pretty simple, I just replace the serializer and add a small convince layer over the top to, for example, allow `map` to take multiple arguments. – Mike McKerns Sep 29 '14 at 22:42