Numpy String Encoding

Question

The module numpy is an excellent tool for memory-efficient storage of python objects, among them strings. For ANSI strings in numpy arrays only 1 byte per character is used.

However, there is one inconvenience. The type of stored objects is no more string but bytes, which means that have to be decoded for further use in most cases, which in turn means quite bulky code:

>>> import numpy
>>> my_array = numpy.array(['apple', 'pear'], dtype = 'S5')
>>> print("Mary has an {} and a {}".format(my_array[0], my_array[1]))
Mary has an b'apple' and a b'pear'
>>> print("Mary has an {} and a {}".format(my_array[0].decode('utf-8'),
... my_array[1].decode('utf-8')))
Mary has an apple and a pear

This inconvenience can be eliminated by using another data type, e.g:

>>> my_array = numpy.array(['apple', 'pear'], dtype = 'U5')
>>> print("Mary has an {} and a {}".format(my_array[0], my_array[1]))
Mary has an apple and a pear

However, this is achieved only by cost of 4-fold increase in memory usage:

>>> numpy.info(my_array)
class:  ndarray
shape:  (2,)
strides:  (20,)

itemsize:  20

aligned:  True
contiguous:  True
fortran:  True
data pointer: 0x1a5b020
byteorder:  little
byteswap:  False
type: <U5

Is there a solution that combines advantages of both efficient memory allocation and convenient usage for ANSI strings?

This is a Python3 issue, which displays byte strings with the `b`. — hpaulj, Aug 25 '15 at 15:30

score 4 · Accepted Answer · edited May 23 '17 at 10:29

It's not a big difference over the decode, but astype works (and can be applied to the whole array rather than each string). But the longer array will remain around as long as it is needed.

In [538]: x=my_array.astype('U');"Mary has an {} and a {}".format(x[0],x[1])
Out[538]: 'Mary has an apple and a pear'

I can't find anything in the format syntax that would force 'b' less formatting.

https://stackoverflow.com/a/19864787/901925 - shows how to customize the Formatter class, changing the format_field method. I tried something similar with the convert_field method. But the calling syntax is still messy.

In [562]: def makeU(astr):
    return astr.decode('utf-8')
   .....: 

In [563]: class MyFormatter(string.Formatter):
    def convert_field(self, value, conversion):
        if 'q'== conversion:
            return makeU(value)
        else:
            return super(MyFormatter, self).convert_field(value, conversion)
   .....:         

In [564]: MyFormatter().format("Mary has an {!q} and a {!q}",my_array[0],my_array[1])
Out[564]: 'Mary has an apple and a pear'

A couple of other ways of doing this formatting:

In [642]: "Mary has an {1} and a {0} or {1}".format(*my_array.astype('U'))
Out[642]: 'Mary has an pear and a apple or pear'

This converts the array (on the fly) and passes it to format as a list. It also works if the array is already unicode:

In [643]: "Mary has an {1} and a {0} or {1}".format(*uarray.astype('U'))
Out[643]: 'Mary has an pear and a apple or pear'

np.char has functions that apply string functions to elements of a character array. With this decode can be applied to the whole array:

In [644]: "Mary has a {1} and an {0}".format(*np.char.decode(my_array))
Out[644]: 'Mary has a pear and an apple'

(this doesn't work if the array is already unicode).

If you do much with string arrays, np.char is worth a study.

Thank you for the profound answer. As I need not only format strings, but also pass single array elements to functions, I opted for making the function: `def U(astr): return astr.decode('utf-8')`, as it requires minimum additional symbols. It is also the most obvious solution. — Roman, Aug 26 '15 at 06:55

dawg · Answer 2 · 2015-08-25T16:32:08.840

Given:

>>> my_array = numpy.array(['apple', 'pear'], dtype = 'S5')

You can decode on the fly:

>>> print("Mary has an {} and a {}".format(*map(lambda b: b.decode('utf-8'), my_array)))
Mary has an apple and a pear

Or you can create a specific formatter:

import string
class ByteFormatter(string.Formatter):
    def __init__(self, decoder='utf-8'):
        self.decoder=decoder

    def format_field(self, value, spec):
        if isinstance(value, bytes):
            return value.decode(self.decoder)
        return super(ByteFormatter, self).format_field(value, spec)   

>>> print(ByteFormatter().format("Mary has an {} and a {}", *my_array))
Mary has an apple and a pear

Numpy String Encoding

2 Answers2

Linked