1

I am still new to Numpy and was messing around the Numpy's dtypes and found that the dtype that is specific for strings, aka 'U', uses up more memory space than the object type. The code that illustrates this fact is down below:

size= 100000
half_size = size//2

ind1 = np.arange(half_size)*2+1
ind2 = np.arange(half_size)*2

X = np.empty(size, dtype = 'object')

X[ind1] = 'smile'
X[ind2] = 'smile2'

W = np.empty(size, dtype = 'U6')
W[ind1] = 'smile'
W[ind2] = 'smile2'

print(X.nbytes)
print(W.nbytes)

The result is the following:

800000
2400000

My questions are the following:

1) Why does this happen? Why does dtype = 'U6' takes up 3 times as much of memory as dtype = object

2) Is there a way to create a string numpy array that takes up less memory space than the dtype = object?

Thank you in advance

EDIT: I'd like to explain that my post is not the duplicate of another post, because my post is about memory usage, and the other post does not mention anything about the memory usage regarding dtype = 'U' vs dtype = 'object'

EDIT2: Although I have already learnt something new fromanother post, unfortunately the other post does not answer my question, because my post is about memory usage, and the other post does not mention anything about the memory usage regarding dtype = 'U' vs dtype = 'object'

mathguy
  • 1,450
  • 1
  • 16
  • 33
  • Possible duplicate of [What does dtype=object mean while creating a numpy array?](https://stackoverflow.com/questions/29877508/what-does-dtype-object-mean-while-creating-a-numpy-array) – ikkuh Jun 27 '19 at 11:40
  • @ikkuh my question is mostly about memory usage, how is a duplicate post? Your post doesn't delve into the comparisons between dtype = 'U' vs dtype = 'object'. – mathguy Jun 27 '19 at 11:41
  • The answer explains that the 'object' array only stores pointers. So the 'U6' array doesn't take 3 times as much memory. – ikkuh Jun 27 '19 at 11:47
  • The post does not mention how 'the 'object' array only stores pointers' leads to less memory usage than the 'U6' type. – mathguy Jun 27 '19 at 11:50
  • 1
    A pointer takes 8 bytes in memory where as unicode string of length 6 take 24 bytes. This is the reason for the difference in memory for the numpy array. However pointers don't store the data for the strings. The actual string data for the object array is somewhere else in memory. – ikkuh Jun 27 '19 at 11:59
  • @ikkuh that's the explanation I am looking for. It clearly addresses the memory usage difference between dtype=object and dtype = 'U6'. – mathguy Jun 27 '19 at 12:03
  • 1
    `nbytes` is just array memory usage, not total. It doesn't account for the memory used by the python strings. In this case that will be small. But in general the 'U' dtype string storage is not memory efficient. There's too much padding. – hpaulj Jun 27 '19 at 15:20

2 Answers2

5

sys.getsizeof is one way of checking memory usage, though you have to use it wisely, understanding what exactly it is measuring. For arrays it works fairly well.

An array without any elements:

In [28]: sys.getsizeof(np.array([],'U6'))                                                            
Out[28]: 96
In [29]: sys.getsizeof(np.array(['smile','smile1'],'U6'))                                            
Out[29]: 144
In [30]: sys.getsizeof(np.array(['smile','smile1'],'S6'))                                            
Out[30]: 108

With 2 'U6' strings, the size jumps by 48, 4 bytes/char *2 elements *6 char per element

With a bytestring dtype (default for Py2), the jump is 12, 2*6.

bytestring is more compact, but note the display:

In [31]: np.array(['smile','smile1'],'S6')                                                           
Out[31]: array([b'smile', b'smile1'], dtype='|S6')

For object dtype:

In [32]: sys.getsizeof(np.array(['smile','smile1'],object))                                          
Out[32]: 112

That's 16 bytes - 2*8

But add to that the size of the Python strings, an extra 133 bytes

In [33]: sys.getsizeof('smile')                                                                      
Out[33]: 78
In [34]: sys.getsizeof('smile1')                                                                     
Out[34]: 55

and for bytestsrings:

In [36]: sys.getsizeof(b'smile')                                                                     
Out[36]: 38
In [37]: sys.getsizeof(b'smile1')                                                                    
Out[37]: 39

Note that when I add a byte character, the size increases by 1. But when I add a unicode character, the size actually decreases. Size of unicode strings is harder to predict. I think it can allocate up to 4 bytes per char, but the actual number depends on the char and encoding. Usually we don't try to micro mange Python's string handling. (On top of that I believe Python has some sort of string cache.)

But when you assign

X[ind1] = 'smile'
X[ind2] = 'smile2'

in the object case you make two Python strings, and assign references (pointers) to the array. So memory usage is that of the array (1000...*8 bytes) plus the 133 bytes for those 2 strings.

In the 'U6' case each element takes up 4*6 bytes, regardless of whether it is 'smile' or 'smile1' (or 's'). Each element of an array uses the same space, regardless of whether all of it is needed to represent the string or not.

In general strings are not a numpy strength. Memory usage of the 'U' or 'S' dtype is ok when the strings have a similar size, but less optimal if the strings vary in length, are repeated, and/or are unicode. numpy doesn't do much of its own string processing. The np.char functions are just thin wrappers of the Python string methods.

pandas has chosen to use the object dtype instead of the string dtypes.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
1

If you check size of each data types in memory you get:

import numpy as np

dt = np.dtype('object')
print('object = %i bytes' % dt.itemsize)

dt = np.dtype('U6')
print('U6 = %i bytes' % dt.itemsize)

Output:

object = 8 bytes
U6 = 24 bytes
Zaraki Kenpachi
  • 5,510
  • 2
  • 15
  • 38
  • Thanks for your answer. Though, I wonder if an Numpy string array created using dtype = object takes up much less memory, why bother with dtype = 'U6'? – mathguy Jun 27 '19 at 12:08
  • @mathguy see this: https://docs.scipy.org/doc/numpy-1.15.1/reference/arrays.dtypes.html – Zaraki Kenpachi Jun 27 '19 at 12:11
  • I haven't found any dtype better than dtype = object, guess I'll stick to it whenever I need to initialize a string array from now on. – mathguy Jun 27 '19 at 12:30
  • Object dtype arrays don't offer many advantages compared to lists, especially when they are are 1d. – hpaulj Jun 27 '19 at 16:02