3

I have several processes each completing tasks which require a single large numpy array, this is only being read (the threads are searching it for appropriate values).

If each process loads the data I receive a memory error.

I am therefore trying to minimise the memory usage by using a Manager to share the same array between the processes.

However I still receive a memory error. I can load the array once in the main process however the moment I try to make it an attribute of the manager namespace I receive a memory error. I assumed the Managers acted like pointers and allowed seperate processes (which normally only have access to their own memory) to have access to this shared memory as well. However the error mentions pickling:

Traceback (most recent call last):
  File <PATH>, line 63, in <module>
    ns.pp = something
  File "C:\Program Files (x86)\Python35-32\lib\multiprocessing\managers.py", line 1021, in __setattr__
    return callmethod('__setattr__', (key, value))
  File "C:\Program Files (x86)\Python35-32\lib\multiprocessing\managers.py", line 716, in _callmethod
    conn.send((self._id, methodname, args, kwds))
  File "C:\Program Files (x86)\Python35-32\lib\multiprocessing\connection.py", line 206, in send
    self._send_bytes(ForkingPickler.dumps(obj))
  File "C:\Program Files (x86)\Python35-32\lib\multiprocessing\reduction.py", line 50, in dumps
    cls(buf, protocol).dump(obj)
MemoryError

I assume the numpy array is actually being copied when assigned to the manager, but I may be wrong.

To make matters a little more irritating I am on a machine with 32GB of memory and watching the memory usage it only increases a little berfore crashing, maybe by 5%-10% at most.

Could someone explain why making the array an attribute of the namespace takes up even more memory? and why my program won't use some of the spare memory avaliable? (I have already read the namespace and manager docs as well as these managers and namespace threads on SO.

I am running Windows Server 2012 R2 and Python 3.5.2 32bit.

Here is some code demonstrating my problem (you will need to use an alternative file to large.txt, this file is ~75MB of tab delimited strings):

import multiprocessing
import numpy as np

if __name__ == '__main__':

    # load Price Paid Data and assign to manager
    mgr = multiprocessing.Manager()
    ns = mgr.Namespace()

    ns.data = np.genfromtxt('large.txt')
    # Alternative proving this work for smaller objects
    # ns.data = 'Test PP data'
martineau
  • 119,623
  • 25
  • 170
  • 301
Harry de winton
  • 969
  • 15
  • 23
  • See [**_Is shared readonly data copied to different processes for multiprocessing?_**](https://stackoverflow.com/questions/5549190/is-shared-readonly-data-copied-to-different-processes-for-multiprocessing). – martineau Aug 11 '17 at 15:05

1 Answers1

5

Manager types are built for flexibility not efficiency. They create a server process that holds the values, and can return proxy objects to each process they are needed in. The server and proxy communicate over tls to allow the server and proxy to be on different machines, but this necessarily means copying whatever object is in question. I haven't traced the source all the way, so it's possible the extra copy may be garbage collected after use, but at least initially there has to be a copy.

If you want shared physical memory, I suggest using Shared ctypes Objects. These actually do point to a common location in memory, and therefore are much faster, and resource-light. They do not support all the same things full fat python objects do, but they can be extended by creating structs to organize your data.

Aaron
  • 10,133
  • 1
  • 24
  • 40
  • you suggested `structs` but would a ctypes Array also be suitable? – Harry de winton Aug 11 '17 at 14:05
  • also what is tls? and is there a specific source of your information (aside from the docs) I can read to learn more? – Harry de winton Aug 11 '17 at 14:07
  • @Harrydewinton the [source](https://github.com/python/cpython/blob/master/Lib/multiprocessing/managers.py) will never lie. tls is [transport layer security](https://en.wikipedia.org/wiki/Transport_Layer_Security). It's a communication protocol to communicate over sockets (which can be inter-process or forwarded over a physical layer like ethernet) – Aaron Aug 11 '17 at 14:10
  • 1
    @Harrydewinton take a look at the example at the end of [this](https://docs.python.org/3.6/library/multiprocessing.html#module-multiprocessing.sharedctypes) section. It shows how you could use a `ctypes.Structure` to give your `multiprocessing.sharedctypes.Array` named values to make working with the numbers a little easier (instead of just remembering array offsets) – Aaron Aug 11 '17 at 14:13
  • 1
    managers are actually really cool, and allow you to do stuff like set up a compute cluster with whatever computers you can install python on. If you're sticking to one physical machine (and moreover one parent process) I'd stick to actual shared memory. – Aaron Aug 11 '17 at 14:16