Beside the fact, that your code should fail when cythonized, because you try to create a Python-object (i.e. str(i)
) without gil, your code isn't doing what you think it should do.
In order to analyse what is going on, let's take a look at a much simple cython-version:
%%cython -2
cimport numpy as np
from numpy cimport ndarray as ar
cpdef func_string(ar[np.str, cast=True] x):
print(len(x))
From your error message, one can deduct that you use Python 3 and the Cython-extension is built with (still default) language_level=2
, thus I'm using -2
in the %%cython
-magic cell.
And now:
>>> x = np.array(["apple", "pear"], dtype=np.str)
>>> func_string(x)
ValueError: Item size of buffer (20 bytes) does not match size of 'str object' (8 bytes)
What is going on?
x
is not what you think it is
First, let's take a look at x
:
>>> x.dtype
<U5
So x
isn't a collection of unicode-objects. An element of x
consist of 5 unicode-characters and those elements are stored contiguously in memory, one after another. What is important: The same information as in unicode-objects stored in a different memory layout.
This is one of numpy's quirks and how np.array
works: every element in the list is converted to an unicode-object, than the maximal size of the element is calculated and dtype (in this case <U5
) is calculated and used.
np.str
is interpreted differently in cython code (ar[np.str] x
) (twice!)
First difference: in your Python3-code np.str
is for unicode
, but in your cython code, which is cythonized with language_level=2
, np.str
is for bytes
(see doc).
Second difference: seeing np.str
, Cython will interpret it as array with Python-objects (maybe it should be seen as a Cython-bug) - it is almost the same as if dtype
were np.object
- actually the only difference to np.object
are slightly different error messages.
With this information we can understand the error message. During the runtime, the input-array is checked (before the first line of the function is executed!):
- expected is an array with python-objects, i.e. 8-byte pointers, i.e. array with element size of 8bytes
- received is an array with element size 5*4=20 bytes (one unicode-character is 4 bytes)
thus the cast cannot be done and the observed exception is thrown.
you cannot change the size of an element in an <U..
-numpy-array:
Now let's take a look at the following:
>>> x = np.array(["apple", b"pear"], dtype=np.str)
>>> x[0] = x[0]+str(0)
>>> x[0]
'apple'
the element didn't change, because the string x[0]+str(0)
was truncated while written back to x
-array: there is only place for 5 characters! It would work (to some degree, as long as resulting string has no more than 5 characters) with "pear"
though:
>>> x[1] = x[1]+str(1)
>>> x[1]
'pear0'
Where does this all leave you?
- you probably want to use
bytes
and not unicodes
(i.e. dtype=np.bytes_
)
- given you don't know the element size of your numpy-array at the compile type, you should declare the input-array
x
as ar x
in the signature and roll out the runtime checks, similar as done in the Cython's "depricated" numpy-tutorial.
- if changes should be done in-place, the elements in the input-array should be big enough for the resulting strings.
All of the above, has nothing to do with prange
. To use prange
you cannot use str(i)
because it operates on python-objects.