Cython prange with an array of string

Question

I'm trying to use prange in order to process multiple strings. As it is not possible to do this with a python list, I'm using a numpy array.

With an array of floats, this function works :

from cython.parallel import prange
cimport numpy as np
from numpy cimport ndarray as ar

cpdef func_float(ar[np.float64_t,cast=True] x, double alpha):
    cdef int i
    for i in prange(x.shape[0], nogil=True):
        x[i] = alpha * x[i]
    return x

When I try this simple one :

cpdef func_string(ar[np.str,cast=True] x):
    cdef int i
    for i in prange(x.shape[0], nogil=True):
        x[i] = x[i] + str(i)
    return x

I'm getting this

>> func_string(x = np.array(["apple","pear"],dtype=np.str))
  File "processing.pyx", line 8, in processing.func_string
    cpdef func_string(ar[np.str,cast=True] x):
ValueError: Item size of buffer (20 bytes) does not match size of 'str object' (8 bytes)

I'm probably missing something and I can't find an alternative to str. Is there a way to properly use prange with an array of string ?

@DavidW I want to keep prange. Changing to range has no effect. — bob koal, Mar 15 '19 at 23:02
Which Python/Cython version are you using? I'm surprised your code compiled at all, because `str(i)` should create a Python-object which shouldn't be possible without gil. — ead, Mar 16 '19 at 05:14

score 1 · Accepted Answer · answered Mar 16 '19 at 07:21

Beside the fact, that your code should fail when cythonized, because you try to create a Python-object (i.e. str(i)) without gil, your code isn't doing what you think it should do.

In order to analyse what is going on, let's take a look at a much simple cython-version:

%%cython -2
cimport numpy as np
from numpy cimport ndarray as ar

cpdef func_string(ar[np.str, cast=True] x):
    print(len(x))

From your error message, one can deduct that you use Python 3 and the Cython-extension is built with (still default) language_level=2, thus I'm using -2 in the %%cython-magic cell.

And now:

>>> x = np.array(["apple", "pear"], dtype=np.str)
>>> func_string(x)    
ValueError: Item size of buffer (20 bytes) does not match size of 'str object' (8 bytes)

What is going on?

x is not what you think it is

First, let's take a look at x:

>>> x.dtype
<U5

So x isn't a collection of unicode-objects. An element of x consist of 5 unicode-characters and those elements are stored contiguously in memory, one after another. What is important: The same information as in unicode-objects stored in a different memory layout.

This is one of numpy's quirks and how np.array works: every element in the list is converted to an unicode-object, than the maximal size of the element is calculated and dtype (in this case <U5) is calculated and used.

np.str is interpreted differently in cython code (ar[np.str] x) (twice!)

First difference: in your Python3-code np.str is for unicode, but in your cython code, which is cythonized with language_level=2, np.str is for bytes (see doc).

Second difference: seeing np.str, Cython will interpret it as array with Python-objects (maybe it should be seen as a Cython-bug) - it is almost the same as if dtype were np.object - actually the only difference to np.object are slightly different error messages.

With this information we can understand the error message. During the runtime, the input-array is checked (before the first line of the function is executed!):

expected is an array with python-objects, i.e. 8-byte pointers, i.e. array with element size of 8bytes
received is an array with element size 5*4=20 bytes (one unicode-character is 4 bytes)

thus the cast cannot be done and the observed exception is thrown.

you cannot change the size of an element in an <U..-numpy-array:

Now let's take a look at the following:

>>> x = np.array(["apple", b"pear"], dtype=np.str)
>>> x[0] = x[0]+str(0)
>>> x[0]
'apple'

the element didn't change, because the string x[0]+str(0) was truncated while written back to x-array: there is only place for 5 characters! It would work (to some degree, as long as resulting string has no more than 5 characters) with "pear" though:

>>> x[1] = x[1]+str(1)
>>> x[1]
'pear0'

Where does this all leave you?

you probably want to use bytes and not unicodes (i.e. dtype=np.bytes_)
given you don't know the element size of your numpy-array at the compile type, you should declare the input-array x as ar x in the signature and roll out the runtime checks, similar as done in the Cython's "depricated" numpy-tutorial.
if changes should be done in-place, the elements in the input-array should be big enough for the resulting strings.

All of the above, has nothing to do with prange. To use prange you cannot use str(i) because it operates on python-objects.

Cython prange with an array of string

1 Answers1