5

I'm working on a project right now, and at some point I am dealing with an nparray of dimensions (165L, 653L, 1024L, 1L). (Around 100MB worth of data).

For JSON compatibility reasons, I need to turn it into a regular list. So I used the regular function

array.tolist()

The problem is that this line results in 10GB worth of RAM consumption. Something seems wrong here, am I not supposed to use tolist() on big arrays ?

I have looked around the web for a bit, I have found some suspicions of tolist() leaking memory, notably here Apparent memory leak with numpy tolist() in long running process and here https://mail.python.org/pipermail/matrix-sig/1998-October/002368.html . But this doesn't seem directly related to my problem.

Skum
  • 506
  • 5
  • 19
  • 2
    Usually because Python constructs **objects** for floats and ints. So this results in rather large memory usage. – Willem Van Onsem Mar 30 '17 at 15:22
  • @WillemVanOnsem Thanks for your answer. I suspected something along those lines, but 10GB still seems like a huge amount. Do you know if there is a way to turn the array into a list in a more efficient way ? – Skum Mar 30 '17 at 15:24
  • 1
    Well an int/float in Python-2.7 take 24 bytes each. So your array is 110 330 880 elements large, so a quick guess would be that it takes at least ~2.5 GiB. Furthermore the list also has memory it has use itself which is usually 8 bytes per element. So it costs at least ~3.5 GiB to store that data. – Willem Van Onsem Mar 30 '17 at 15:27
  • 1
    Try accessing parts of your large array to deal with your JSON issue. Also, you could use numpy.memmap to free up some RAM. This will map your numpy array to disk. – ma3oun Mar 30 '17 at 15:28
  • 1
    Flattening the array will reduce the number of intermediate pointers. That last dimension of size 1 needlessly adds a layer of indirection for every element of the array. – hpaulj Mar 30 '17 at 15:52
  • @Schmuddi: _"The function getsize reveals that a np.int64 object consumes 1968 bytes, and np.float64 consumes 2091 bytes in Python 2.7"_ - that doesn't sound right at all. For me, on 3.5, `[sys.getsizeof(t(0)) for t in [np.float64, np.int64]]` gives `[32, 32]`. – Eric Mar 30 '17 at 16:08
  • 1
    @Eric: You're right, and I'm stupid: the *class definitions* of `np.float64` and `np.int64` consume those many bytes, but not the instances. Thanks for spotting that embarrassing mistake. – Schmuddi Mar 30 '17 at 16:14
  • 2
    Just to add to the consensus: Since `.tolist` (unlike `list()`) listifies recursively and your last dimension is one you have more lists than numbers in your list of lists of lists of lists. A list storing a single element takes 80 bytes in Py2, so the memory consumption you are seeing is entirely plausible. – Paul Panzer Mar 30 '17 at 17:04
  • 1
    Your real issue here is that you're trying to serialize the whole thing to json in memory. Even if you didn't go via a `list`, the output json is going to be pretty huge. You need to find a streaming json encoder that allows you to customize handling `nd.array`. Unfortunately, `json.JSONEncoder._default` isn't useful here. – Eric Mar 30 '17 at 21:49
  • Also, perhaps your array is sparse, which is why it occupies so "little" space – BlackBear Sep 11 '17 at 11:50

1 Answers1

1

Instead of trying to convert the entire 165 x 653 x 1024 x 1 matrix to a list so that you can turn around and convert it to JSON, just do 165 conversions to list of the 653 x 1024 inner dimensions, and write them to a file with your own explicit separators.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • Thanks, the problem was actually solved a while ago and this thread got bumped because I edited a typo (didn't know it would bump it). But your way of doing it would work so I'll confirm it for future users. – Skum Sep 11 '17 at 11:59