20

I trying to deal with writing huge amount of pickled data to disk by small pieces. Here is the example code:

from cPickle import *
from gc import collect

PATH = r'd:\test.dat'
@profile
def func(item):
    for e in item:
        f = open(PATH, 'a', 0)
        f.write(dumps(e))
        f.flush()
        f.close()
        del f
        collect()

if __name__ == '__main__':
    k = [x for x in xrange(9999)]
    func(k)

open() and close() placed inside loop to exclude possible causes of accumulation of data in memory.

To illustrate problem I attach results of memory profiling gained with Python 3d party module memory_profiler:

   Line #    Mem usage  Increment   Line Contents
==============================================
    14                           @profile
    15      9.02 MB    0.00 MB   def func(item):
    16      9.02 MB    0.00 MB       path= r'd:\test.dat'
    17
    18     10.88 MB    1.86 MB       for e in item:
    19     10.88 MB    0.00 MB           f = open(path, 'a', 0)
    20     10.88 MB    0.00 MB           f.write(dumps(e))
    21     10.88 MB    0.00 MB           f.flush()
    22     10.88 MB    0.00 MB           f.close()
    23     10.88 MB    0.00 MB           del f
    24                                   collect()

During execution of the loop strange memory usage growth occurs. How it can be eliminated? Any thoughts?

When amount of input data increases volume of this additional data can grow to size much greater then input (upd: in real task i get 300+Mb)

And more wide question - which ways exist to properly work with big amounts of IO data in Python?

upd: I rewrote the code leaving only the loop body to see when growth happens specifically, and here the results:

Line #    Mem usage  Increment   Line Contents
==============================================
    14                           @profile
    15      9.00 MB    0.00 MB   def func(item):
    16      9.00 MB    0.00 MB       path= r'd:\test.dat'
    17
    18                               #for e in item:
    19      9.02 MB    0.02 MB       f = open(path, 'a', 0)
    20      9.23 MB    0.21 MB       d = dumps(item)
    21      9.23 MB    0.00 MB       f.write(d)
    22      9.23 MB    0.00 MB       f.flush()
    23      9.23 MB    0.00 MB       f.close()
    24      9.23 MB    0.00 MB       del f
    25      9.23 MB    0.00 MB       collect()

It seems like dumps() eats memory. (While I actually thought it will be write())

Gill Bates
  • 14,330
  • 23
  • 70
  • 138
  • 1
    First, you're only at 11MB. Are you sure there's a real problem? Have you actually _tried_ it with large amounts of data to see if it increases linearly to some scary level? Second, the increment happens on the `for` loop (so presumably inside `item.__next__`), not the `dumps` line. (And if you _do_ think it's the pickling, why haven't you tried splitting `dumps` and `write` into separate steps?) – abarnert Dec 14 '12 at 00:59
  • 3
    Also, `memory_profiler` says it "gets the memory consumption by querying the operating system kernel about the amount of memory the current process has allocated, which might be slightly different from the ammount of memory that is actually used by the Python interpreter". In fact, it may be way, way different! Just because Python calls `free` doesn't necessarily mean the platform's allocator releases it all immediately to the OS—in fact, it's perfectly reasonable for it to hold onto the page mappings and never release them. – abarnert Dec 14 '12 at 01:02
  • Why do you keep opening and close the output file? Seems like it would be a lot more efficient to leave it open for the whole loop. Doubt that has anything to do with you're supposed memory usage growth. Does the memory usage keep getting bigger and bigger or is there just that one jump shown in your question? – martineau Dec 14 '12 at 01:03
  • 2
    For your "wide question": It depends on how big you mean by big. But the two basic strategies are: don't use that much (e.g., use a `numpy` array of `int`s instead of a name `list` of `list`s of Python objects), or use a database (`anydbm` or `sqlite3`) instead of building a giant in-memory store and persisting it to disk en masse. – abarnert Dec 14 '12 at 01:08
  • @abarnert I used small input to get test result faster. In my real task Im getting 300+Mb. About when increment happens - it seems like profiling tool show all data increment during loop against line when loop starts. I actually dont think that is Pickle eats memory, maybe its unclear in text of my post. I think growth happens somewhere in file IO part. If rewrite code without loop, it will be seen, that memory usage increases at write(). I will update post – Gill Bates Dec 14 '12 at 01:09
  • How much data do you actually have in your real task? – abarnert Dec 14 '12 at 01:11
  • @abarnert less than 100Mb, this is just test sample size. – Gill Bates Dec 14 '12 at 01:22
  • @martineau Ive placed opening and closing statements inside the loop to be sure that some kind of file buffers do not accumulates the memory – Gill Bates Dec 14 '12 at 01:26
  • 2
    Check out [streaming-pickle](https://code.google.com/p/streaming-pickle/) which supposedly would use a lot less memory for what you're doing. – martineau Dec 14 '12 at 01:34
  • Do you really need to have the real data all in memory, and serialize it all at once, or could you use, e.g., `shelve` and let it worry about persistence for you? Or, alternatively, if your data can be broken into independent pieces, serialize them independently? – abarnert Dec 14 '12 at 02:00
  • @abarnert: Your comment makes me wonder if the issue is actually the reverse, that interconnected objects are being pickled separately. @GilBates: what are the types of the items you're `pickle`ing? Do they have references to each other, or to some other objects? If your items are themselves small but reference some common large object the results you're getting would not be too surprising, since each of the calls to `dumps` would be repickling that large object. Also, have you tried other pickle protocols? The newer ones are designed to be more space efficient. – Blckknght Dec 14 '12 at 04:26
  • @martineau @abarnert Actually what Im trying to do is write small 'serial' pickle module, which could operate with huge data stores only small chunk of it in RAM. For example program address with lookup trough 1Gb of data, downloaded from DB and stored in persistent object in cache on disk. In such case persistent-module will load small chunks of object in memory by necessity, and level of memory usage will not raise above critical level. Standard IO of Python goes crazy when I use it in that way, so thanks for pointing me to `anydbm` and `sqlite3` modules, think I will use them. – Gill Bates Dec 14 '12 at 04:31
  • BUT question that still concerns me is - _why_ `pickle` _accumulates_ _memory?_ Such functions like `dumps()` should just pop out pickled string which disappears then from the scene. But practice shows that often calls of `dumps()` cause increasing growth of memory. Even in case of usage of DB libraries, `pickle` will still be used, and will eat RAM :) – Gill Bates Dec 14 '12 at 04:41
  • @Blckknght Well, such effect can be observed with just feeding to `pickle` list of `[9999999]` integers. Ive tried different protocols also - without much effect. I dont care about volume of pickled data, but I care about level of used ram. I need to learn how load data from disk and how to save it _freeing_ _memory_ _with_ _this_ – Gill Bates Dec 14 '12 at 04:54
  • Have you tried dumps(e, -1)? – tdihp Dec 14 '12 at 06:09
  • 1
    @GillBates: Have you tested, .e.g, just storing your data in a `shelve` to see if it actually _does_ use memory this way, instead of assuming it must because it uses `pickle`? Also, does your peak data usage actually overrun your bounds (or, if you're on 64-bit, throw you into swap thrash hell)? There are some use cases in Python that seem to be linear in space, but are actually just linear up to some constant limit after which they flatten out (by reusing that same storage). Especially if you're on a platform that doesn't usually release memory to the kernel and you're measuring from outside. – abarnert Dec 14 '12 at 20:32

1 Answers1

17

Pickle consume a lot of RAM, see explanations here : http://www.shocksolution.com/2010/01/storing-large-numpy-arrays-on-disk-python-pickle-vs-hdf5adsf/

Why does Pickle consume so much more memory? The reason is that HDF is a binary data pipe, while Pickle is an object serialization protocol. Pickle actually consists of a simple virtual machine (VM) that translates an object into a series of opcodes and writes them to disk. To unpickle something, the VM reads and interprets the opcodes and reconstructs an object. The downside of this approach is that the VM has to construct a complete copy of the object in memory before it writes it to disk.

Pickle is great for small use cases or testing because in most case the memory consumption doesn't matter a lot.

For intensive work where you have to dump and load a lot of files and/or big files you should consider using another way to store your data (ex.: hdf, wrote your own serialize/deserialize methods for your object, ...)

AMairesse
  • 391
  • 3
  • 9
  • 2
    does it loads the data in CPU memory or GPU memory? , will it release on its own instantaneously after it's dumped to the file? What I have seen is, that it fills up the GPU memory and doesn't release the memory even after it's dumped – Tushar Seth Sep 27 '19 at 20:36
  • @TusharSeth I think I am facing the same problem as highlighted by the question that I asked [today](https://stackoverflow.com/questions/60432137/jupyter-notebook-memory-management). Did you manage to find a solution to this problem? – A Merii Feb 27 '20 at 14:23