Why does numpy takes so much memory for an array of strings?

Question

It seems putting an array of strings into an numpy array takes over 20 times more memory than the raw array. I could understand that it would take 10% more memory due to some overhead, but I would like to know why it takes 2000% percent more.

import numpy as np
from sys import getsizeof

txt = ["adsfjwofj owejifowijefiwjfoi of wofjwoijfwoijfoiwej"]
print(getsizeof(txt))

txts = [txt for _ in range(10000)]
print(getsizeof(txts))

txts_np = np.array(txts)
print(getsizeof(txts_np))

The output:

72
87624
2040112

I thought there was something wrong with my installation, but I tried it also on another machine with a different numpy version and got the same result.

`getsizeof` isn't recursive, list don't contain the actual objects, they contain pointers to them. A more fair comparison will be: `sum(getsizeof(x) for x in txts)`. I am not sure how `__sizeof__` works in case of nd arrays though, if it's implemented. — Ashwini Chaudhary, Aug 01 '18 at 17:42
https://stackoverflow.com/questions/14208410/deep-version-of-sys-getsizeof — Fred, Aug 01 '18 at 17:46

score 0 · Answer 1 · answered Aug 01 '18 at 18:47

This is a self-answer since it was answered in the comments by @Ashwini Chaudhary.

My observation that numpy takes much more memory than a raw array is not valid. sys.getsizeof is not a good tool for reporting memory usage: it seems that it reports correctly the size of the numpy array, but only the size occupied by the pointers in the raw array.

Why does numpy takes so much memory for an array of strings?

1 Answers1