4

Consider a C buffer of N elements created with:

from ctypes import byref, c_double

N = 3
buffer = (c_double * N)()
# C++ function that fills the buffer byref
pull_function(byref(buffer))
# load buffer in numpy
data = np.frombuffer(buffer, dtype=c_double)

Works great. But my issue is that the dtype may be numerical (float, double, int8, ...) or string.

from ctypes import byref, c_char_p

N = 3
buffer = (c_char_p * N)()
# C++ function that fills the buffer byref
pull_function(byref(buffer))
# Load in.. a list?
data = [v.decode("utf-8") for v in buffer]

How can I load those UTF-8 encoded string directly in a numpy array? np.char.decode seems to be a good candidate, but I can't figure out how to use it. np.char.decode(np.frombuffer(buffer, dtype=np.bytes_)) is failing with ValueError: itemsize cannot be zero in type.

EDIT: The buffer can be filled from the Python API. The corresponding lines are:

x = [list of strings]
x = [v.encode("utf-8") for v in x]
buffer = (c_char_p * N)(*x)
push_function(byref(buffer))

Note that this is a different buffer from the one above. push_function pushes the data in x on the network while pull_function retrieves the data from the network. Both are part of the LabStreamingLayer C++ library.

Edit 2: I suspect I can get this to work if I can reload the 'push' buffer into a numpy array before sending it to the network. The 'pull' buffer is probably the same. In that sense, here is a MWE demonstrating the ValueError described above.

from ctypes import c_char_p

import numpy as np


x = ["1", "23"]
x = [elt.encode("utf-8") for elt in x]
buffer = (c_char_p * 2)(*x)
np.frombuffer(buffer, dtype=np.bytes_)  # fails
[elt.decode("utf-8") for elt in buffer]  # works
Mathieu
  • 5,410
  • 6
  • 28
  • 55
  • Can you modify the C function? Is there a maximum length to each string? – Hack5 Dec 12 '22 at 16:51
  • @Hack5 Hello, I don't have control on the C functions. It does not seem to have a maximum length. I tried with strings of 100k elements, and the buffer filling/loading worked. The Python API can fill the buffer by calling another C function. I added the corresponding lines in the post. – Mathieu Dec 12 '22 at 19:08
  • I think that you'll either need to parse the strings in pure python or in C. Numpy seems to expect fixed-maximum-length strings in a buffer of length (string_length * string_count). Once you've got them in that form, you can easily import them into np via np.dtype("a" + str(total_length)) – Hack5 Dec 13 '22 at 09:25
  • @Hack5 Yet numpy supports arrays with strings of variable size, right? `np.array(["1", "123"])` has a dtype of ` – Mathieu Dec 13 '22 at 10:01
  • >>> a = np.array(["1", "123"]) >>> a.tobytes() b'1\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x001\x00\x00\x002\x00\x00\x003\x00\x00\x00' It functions exactly the same as the "a3" dtype but with a utf-32 encoding. That is, it can store UP TO 3 characters, and they are stored "flat". – Hack5 Dec 13 '22 at 10:19
  • The error message suggests that `v` is the empty string (`''`) or maybe a control character like `EOF` in some instances. Are you sure the buffer is full of non-zero-length data? – Bill Horvath Dec 14 '22 at 20:31
  • @BillHorvath I don't know how to debug that so I can't answer your question, but please find a second edit where I tried to reload the buffer pushed to the network (and hypothetically, exactly the same as the buffer I try to pull from the network). At least this way, the problem is reproducible. – Mathieu Dec 14 '22 at 22:08

2 Answers2

2

You can convert a byte buffer to a Python string using string_at of ctypes. Using buffer.decode("utf-8") also works as your saw (on one c_char_p, not an array of them).

c_char_p * N is an array of pointer of characters (basically an array of C strings having a C type char*[3]). The point is Numpy stores strings using a flat buffer so a copy is nearly mandatory. All the strings of a Numpy array have a bounded size and the reserved size of the overall array is arr.size * maxStrSize * bytePerChar where maxStrSize is the biggest string of the array unless manually changed/specified and bytePerChar is 1 for Numpy byte string arrays (ie. S) and typically 4 for Numpy unicode string arrays (ie. U). Indeed, Numpy should use the UCS-4 encoding for unicode string (AFAIK, unicode strings could also be represented in memory as UCS-2 depending on how the Python interpreter was compiled, but one can check if the UCS-4 coding is used by checking if np.dtype('U1').itemsize == 4 is actually true). The only way not to do a copy is if your C++ code can directly write in a preallocated Numpy array. This means the C++ code must use the same representation than Numpy arrays and the bounded size of all the strings is known before calling the C++ function.

np.frombuffer interprets a buffer as a 1-dimensional array. Thus the buffer needs to be flat while your buffer is not so np.frombuffer cannot be directly used in this case.

A quite inefficient solution is simply to convert strings to CPython bytes array and then build a Numpy array with all of them so Numpy will find the biggest string, allocate the big buffer and copy each strings. This is trivial to implement: np.array([elt.decode("utf-8") for elt in buffer]). This is not very efficient since CPython does the conversion of each string and allocates string that are then read by Numpy before being deallocated.

A faster solution is to copy each string in a raw buffer and then use np.frombuffer. But this is not so simple in practice: one need to check the size of the strings using strlen (or to known the bounded size if any), then allocate a big buffer, then use a memcpy loop (one should not forget to write the final 0 character after that if a string is smaller than the maximum size) and then finally use np.frombuffer (by specifying dtype='S%d' % maxLen). This can be certainly done in Cython or using C extensions for the sake of performance. A better alternative is to preallocate a Numpy array and write directly in its raw buffer. There is a problem though: this only works for ASCII/byte string arrays (ie. S), not for unicode ones (ie. U). For unicode strings, the strings needs to be decoded from the UTF-8 encoding and then encoded back to an UCS-2/UCS-4 byte-buffer. np.frombuffer cannot be used in this case because of the zero-sized dtype as pointed out by @BillHorvath. Thus, one need to do that more manually since AFAIK there is no way to do that efficiently using only CPython or Numpy. The best is certainly to do that in C using fast specialized libraries. Note that unicode strings tends to be inherently inefficient (because of the variable size of each character) so please consider using byte strings if the target strings are guaranteed to be ASCII ones.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • 1
    Thank you for the great explanation. At least for now, I will keep the conversion with a list comprehension, which is simpler and faster than `np.array([elt.decode("utf-8") for elt in buffer])`. – Mathieu Dec 17 '22 at 11:00
1

It looks like the error message you're seeing is because bytes_ is a flexible data type, whose itemsize is 0 by default:

The 24 built-in array scalar type objects all convert to an associated data-type object. This is true for their sub-classes as well...Note that not all data-type information can be supplied with a type-object: for example, flexible data-types have a default itemsize of 0, and require an explicitly given size to be useful.

And reconstructing an array from a buffer using a dtype that is size 0 by default fails by design.

If you know in advance the type and length of the data you'll see in the buffer, then this answer might have the solution you're looking for:

One way to customize the dtype is to assign names in getnfromtxt, and recast the values after with astype.

Bill Horvath
  • 1,336
  • 9
  • 24