The module numpy is an excellent tool for memory-efficient storage of python objects, among them strings. For ANSI strings in numpy arrays only 1 byte per character is used.
However, there is one inconvenience. The type of stored objects is no more string
but bytes
, which means that have to be decoded for further use in most cases, which in turn means quite bulky code:
>>> import numpy
>>> my_array = numpy.array(['apple', 'pear'], dtype = 'S5')
>>> print("Mary has an {} and a {}".format(my_array[0], my_array[1]))
Mary has an b'apple' and a b'pear'
>>> print("Mary has an {} and a {}".format(my_array[0].decode('utf-8'),
... my_array[1].decode('utf-8')))
Mary has an apple and a pear
This inconvenience can be eliminated by using another data type, e.g:
>>> my_array = numpy.array(['apple', 'pear'], dtype = 'U5')
>>> print("Mary has an {} and a {}".format(my_array[0], my_array[1]))
Mary has an apple and a pear
However, this is achieved only by cost of 4-fold increase in memory usage:
>>> numpy.info(my_array)
class: ndarray
shape: (2,)
strides: (20,)
itemsize: 20
aligned: True
contiguous: True
fortran: True
data pointer: 0x1a5b020
byteorder: little
byteswap: False
type: <U5
Is there a solution that combines advantages of both efficient memory allocation and convenient usage for ANSI strings?