2

It seems putting an array of strings into an numpy array takes over 20 times more memory than the raw array. I could understand that it would take 10% more memory due to some overhead, but I would like to know why it takes 2000% percent more.

import numpy as np
from sys import getsizeof

txt = ["adsfjwofj owejifowijefiwjfoi of wofjwoijfwoijfoiwej"]
print(getsizeof(txt))

txts = [txt for _ in range(10000)]
print(getsizeof(txts))

txts_np = np.array(txts)
print(getsizeof(txts_np))

The output:

72
87624
2040112

I thought there was something wrong with my installation, but I tried it also on another machine with a different numpy version and got the same result.

toto2
  • 5,306
  • 21
  • 24
  • 2
    `getsizeof` isn't recursive, list don't contain the actual objects, they contain pointers to them. A more fair comparison will be: `sum(getsizeof(x) for x in txts)`. I am not sure how `__sizeof__` works in case of nd arrays though, if it's implemented. – Ashwini Chaudhary Aug 01 '18 at 17:42
  • 1
    https://stackoverflow.com/questions/14208410/deep-version-of-sys-getsizeof – Fred Aug 01 '18 at 17:46

1 Answers1

0

This is a self-answer since it was answered in the comments by @Ashwini Chaudhary.

My observation that numpy takes much more memory than a raw array is not valid. sys.getsizeof is not a good tool for reporting memory usage: it seems that it reports correctly the size of the numpy array, but only the size occupied by the pointers in the raw array.

toto2
  • 5,306
  • 21
  • 24