4

I have a function that I'd like to use Cython with that involves processing large numbers of fixed-length strings. For a standard cython function, I can declare the types of arrays like so:

cpdef double[:] g(double[:] in_arr):
    cdef double[:] out_arr = np.zeros(in_arr.shape, dtype='float64')

    cdef i
    for i in range(len(in_arr)):
        out_arr[i] = in_arr[i]

    return out_arr

This compiles and works as expected when the dtype is something simple like int32, float, double, etc. However, I cannot figure out how to create a typed memoryview of fixed-length strings - i.e. the equivalent of np.dtype('a5'), for example.

If I use this:

cpdef str[:] f(str[:] in_arr):
    # arr should be a numpy array of 5-character strings
    cdef str[:] out_arr = np.zeros(in_arr.shape, dtype='a5')

    cdef i
    for i in range(len(in_arr)):
        out_arr[i] = in_arr[i]

    return out_arr

The function compiles, but this:

in_arr = np.array(['12345', '67890', '22343'], dtype='a5')
f(in_arr)

Throws the following error:

---> 16 cpdef str[:] f(str[:] in_arr): 17 # arr should be a numpy array of 5-character strings 18 cdef str[:] out_arr = np.zeros(in_arr.shape, dtype='a5')

ValueError: Buffer dtype mismatch, expected 'unicode object' but got a string

Similarly if I use bytes[:], it gives the error "Buffer dtype mismatch, expected 'bytes object' but got a string" - and this doesn't even get to the issue with the fact that nowhere am I specifying that these strings have length 6.

Interestingly, I can include fixed-length strings in a structured type as in this question, but I don't think that's the right way to declare the types.

Paul
  • 10,381
  • 13
  • 48
  • 86

1 Answers1

5

In a Python3 session, your a5 array contains bytestrings.

In [165]: np.array(['12345', '67890', '22343'], dtype='a5')
Out[165]: 
array([b'12345', b'67890', b'22343'], 
      dtype='|S5')

http://cython.readthedocs.io/en/latest/src/tutorial/strings.html says that str is unicode string type when compiled with Python3.

I suspect that np.array(['12345', '67890', '22343'], dtype='U5') would be accepted as the input array for your function. But copying to the a5 out_arr would have problems.

object version

An object version of this loop works:

cpdef str[:] objcopy(str[:] in_arr):
    cdef str[:] out_arr = np.zeros(in_arr.shape[0], dtype=object)
    cdef int N
    N = in_arr.shape[0]
    for i in range(N):
        out_arr[i] = in_arr[i]
    return out_arr

narr = np.array(['one','two','three'], dtype=object)
cpy = objcopy(narr)
print(cpy)
print(np.array(cpy))
print(np.array(objcopy(np.array([None,'one', 23.4]))))

These functions return a memoryview, which has to be converted to array to print.

single char version

Single byte memoryview copy:

cpdef char[:] chrcopy(char[:] in_arr):
    cdef char[:] out_arr = np.zeros(in_arr.shape[0], dtype='uint8')
    cdef int N
    N = in_arr.shape[0]
    for i in range(N):
        out_arr[i] = in_arr[i]
    return out_arr
print(np.array(chrcopy(np.array([b'one',b'two',b'three']).view('S1'))).view('S5'))

Uses view to convert strings to single bytes and back.

2d unicode version

I looked into this issue last year: Cython: storing unicode in numpy array

This processes unicode strings as though they were rows of a 2d int array; reshape is needed before and after.

cpdef int[:,:] int2dcopy(int[:,:] in_arr):
    cdef int[:,:] out_arr = np.zeros((in_arr.shape[0], in_arr.shape[1]), dtype=int)
    cdef int N
    N = in_arr.shape[0]
    for i in range(N):
        out_arr[i,:] = in_arr[i,:]
    return out_arr

narr = np.array(['one','two','three', 'four', 'five'], dtype='U5')
cpy = int2dcopy(narr.view('int').reshape(-1,5))
print(cpy)
print(np.array(cpy))
print(np.array(cpy).view(narr.dtype)) # .reshape(-1)

For bytestrings a similar 2d char version should work.

c struct version

byte5 = cython.struct(x=cython.char[5])
cpdef byte5[:] byte5copy(byte5[:] in_arr):
    cdef byte5[:] out_arr = np.zeros(in_arr.shape[0], dtype='|S5')
    cdef int N
    N = in_arr.shape[0]
    for i in range(N):
        out_arr[i] = in_arr[i]
    return out_arr

narr = np.array(['one','four','six'], dtype='|S5')
cpy = byte5copy(narr)
print(cpy)
print(repr(np.array(cpy)))
# array([b'one', b'four', b'six'], dtype='|S5')

The C struct is creating a memoryview with 5 byte elements, which map onto array S5 elements.

https://github.com/cython/cython/blob/master/tests/memoryview/numpy_memoryview.pyx also has a structured array example with bytestrings.

Community
  • 1
  • 1
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • This doesn't account for the fact that `bytes[:]` does not work. – Paul Mar 01 '17 at 23:55
  • Last year I thought of handling unicode strings as rows of a 2d array, `U5` being a 5 column `int` row. – hpaulj Mar 02 '17 at 04:06
  • Haven't had time to look at your edits yet, but if you're still working on this, there's no need to focus on the unicode part. I'm fine with using bytestrings if necessary, they are fixed length ASCII identifiers. – Paul Mar 02 '17 at 04:59
  • My best try so far works the same for unicode and bytes - just using `int` v. `uint8` (4 bytes v 1 per 'elment'). – hpaulj Mar 02 '17 at 05:34
  • Yeah, looking at this, it seems like the `object` version is the best of these, but even abandoning the idea of using `memoryview` objects, why is there not a version of this that can do something similar to `cdef np.ndarray[np.dtype('a6')] arr = ...`? Treating it as an array of objects rather than an array of fixed-length strings seems like it is throwing away pretty valuable type and memory-layout information. – Paul Mar 02 '17 at 17:48