11

This question is derived from here.

I have three large lists containing python objects (l1, l2 and l3). These lists are created when the program starts and they take total of 16GB of RAM. The program will be used on linux exclusively.

I do not need to modify these lists or the objects in these lists in any way or form after they are created. They must remain in memory until the program exits.

I am using os.fork() and multiprocessing module in my program to spawn multiple sub-processes (up to 20 currently). Each of these sub-processes needs to be able to read the three lists (l1, l2 and l3).

My program is otherwise working fine and quite fast. However i am having problems with memory consumption. I was hoping that each sub-process can use the three lists without copying them in memory due to the copy-on-write approach on Linux. However this is not the case as referencing any object in any of these lists will increase the associated ref counts and therefore causes the entire page of memory to be copied.

So my question would be:

Can i disable the reference counting on l1, l2 and l3 and all of the objects in these lists? Basically making the entire object (including meta-data such as ref count) read-only, so that it will never be modified under any circumstances (this, i assume, would allow me to take advantage of copy-on-write).

Currently i fear that i am forced to move to another programming language to accomplish this task because of a "feature" (ref counting) that i do not need currently, but what is still forced upon me and causing unnecessary problems.

Community
  • 1
  • 1
FableBlaze
  • 1,785
  • 3
  • 16
  • 21
  • 3
    From your other question, I see that you have bitarrays and arrays of integers taking a lot of space. I would start by not using raw CPython objects to describe them, since that adds a lot of overhead. On the other hand, I would expect numpy has a better chance to alleviate this issue. Since you are not mentioning numpy at all, I consider you simply didn't bother looking into it. But doesn't http://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html solves most of your problems here (this supposes you also have this data in disk) ? – mmgp Jan 03 '13 at 12:55
  • Going to look into it. But wouldn't constantly reading data from file on the harddrive be a lot slower compared to reading the same data from memory? Furthermore multiple processes probably need to read the same data at the same time at some point (in my case it would be quite often). Wouldn't this mean that some processes will just wait while another one is reading? – FableBlaze Jan 03 '13 at 14:26
  • @anti666 for the CPU, memory mapped files look like paged out RAM: it tries to read a memory page, fails with a page fault and pages in the data (i.e. one read operation). Then the contents of the mmaped file stays in RAM until the system needs it for something more worthwile. See http://duartes.org/gustavo/blog/post/page-cache-the-affair-between-memory-and-files – nd. Jan 03 '13 at 15:02

1 Answers1

4

You can't, reference counting is fundamental to CPython (the reference implementation, and the one you are using). Using methods on objects cause reference counts to change, item subscription or attribute access causes objects to be added and removed from the stack, which uses reference counts, etc. You cannot get around this.

And if the contents of the lists don't change, use tuple()s instead. That won't change the fact that they'll be refcounted though.

Other implementations of Python (Jython (using the Java virtual machine), IronPython (a .NET runtime language) or PyPy (Python implemented in Python, but experimenting with JIT and other compiler techniques) are free to use different methods of memory management, and may or may not solve your memory problem.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 1
    Refcounting is fundamental to CPython. Other Pythons don't have it at all (they usually use mark & sweep, which is usually also applied to all objects, but that's a different issue). –  Jan 03 '13 at 12:46
  • I thought `weakref` was for cases like this - getting a reference to an object without increasing the refcount. Am I wrong on this? – l4mpi Jan 03 '13 at 12:49
  • 1
    @l4mpi: nope, you are not wrong. That one weakref won't increase the ref count. But the interpreter itself will still use ref counting. – Martijn Pieters Jan 03 '13 at 12:50
  • @delnan: quite right, but that's what the OP is using in any case. – Martijn Pieters Jan 03 '13 at 12:51
  • Yeah, but OP probably *could* use another implementation as well (at least there's no mention of Numpy or the like). I don't actually believe this would resolve the issue, but either you argue that it's only a problem for CPython (which begs the question about other implementations), or you extend the argument to all implementation. –  Jan 03 '13 at 12:53
  • @delnan: problem is, I never use Jython or IronPython or PyPy, so I feel ill equipped to make any recommendations about using those. For CPython, the answer to the OP question 'can I turn off reference counting' is a very firm "No". – Martijn Pieters Jan 03 '13 at 13:57