3

I have a long unicode string:

alphabet = range(0x0FFF)
mystr = ''.join(chr(random.choice(alphabet)) for _ in range(100))
mystr = re.sub('\W', '', mystr)

I would like to view it as a series of code points, so at the moment, I am doing the following:

arr = np.array(list(mystr), dtype='U1')

I would like to be able to manipulate the string as numbers, and eventually get some different code points back. Now I'd like to invert the transformation:

mystr = ''.join(arr.tolist())

These transformations are reasonably fast and invertible, but take up an unnecessary amount of space with the list intermediary.

Is there a way to convert a numpy array of unicode characters to and from a Python string without converting to a list first?

Afterthoughts

I can get arr to appear as a single string with something like

buf = arr.view(dtype='U' + str(arr.size))

This results in a 1-element array containing the entire original. The inverse is possible as well:

buf.view(dtype='U1')

The only issue is that the type of the result is np.str_, not str.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264

2 Answers2

3

fromiter works, but is really slow, since it goes through the iterator protocol. It's much faster to encode your data to UTF-32 (in system byte order) and use numpy.frombuffer:

In [56]: x = ''.join(chr(random.randrange(0x0fff)) for i in range(1000))

In [57]: codec = 'utf-32-le' if sys.byteorder == 'little' else 'utf-32-be'

In [58]: %timeit numpy.frombuffer(bytearray(x, codec), dtype='U1')
2.79 µs ± 47 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [59]: %timeit numpy.fromiter(x, dtype='U1', count=len(x))
122 µs ± 3.82 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [60]: numpy.array_equal(numpy.fromiter(x, dtype='U1', count=len(x)), numpy.fr
    ...: ombuffer(bytearray(x, codec), dtype='U1'))
Out[60]: True

I've used sys.byteorder to determine whether to encode in utf-32-le or utf-32-be. Also, using bytearray instead of encode gets a mutable bytearray instead of an immutable bytes object, so the resulting array is writable.


As for the reverse conversion, arr.view(dtype=f'U{arr.size}')[0] works, but using item() is a bit faster and produces an ordinary string object, avoiding possible weird edge cases where numpy.str_ doesn't quite behave like str:

In [72]: a = numpy.frombuffer(bytearray(x, codec), dtype='U1')

In [73]: type(a.view(dtype=f'U{a.size}')[0])
Out[73]: numpy.str_

In [74]: type(a.view(dtype=f'U{a.size}').item())
Out[74]: str

In [75]: %timeit a.view(dtype=f'U{a.size}')[0]
3.63 µs ± 34 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [76]: %timeit a.view(dtype=f'U{a.size}').item()
2.14 µs ± 23.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Finally, be aware that NumPy doesn't handle nulls like normal Python string objects do. NumPy can't distinguish between 'asdf\x00\x00\x00' and 'asdf', so using NumPy arrays for string operations is not safe if your data may contain null code points.

user2357112
  • 260,549
  • 28
  • 431
  • 505
  • I am fairly sure that `U1` stands for system byte-order. In my case it translates to ` – Mad Physicist Jan 29 '19 at 21:21
  • @MadPhysicist: I was under the impression that the `>` or `<` in a `U` dtype controlled order of code points relative to each other, but apparently it controls byte order within a code point. – user2357112 Jan 29 '19 at 21:27
  • I didn't even know `item` existed. Definitely selecting this answer over mine. – Mad Physicist Jan 29 '19 at 21:40
  • I've added `arr = np.array([mystr]).view(dtype='U1')` to my answer. It seems to be much faster than any of the other options. And I did check the units this time. It's conceptually a better inverse to `.item()` in my mind, which is nice. – Mad Physicist Jan 29 '19 at 22:01
  • @MadPhysicist: Oh, right, that's an option. I got distracted by complexity and totally forgot you can just pass the string to NumPy directly. – user2357112 Jan 29 '19 at 22:03
  • I never considered it either. Just stumbled onto it by accident :) – Mad Physicist Jan 29 '19 at 22:04
  • @MadPhysicist: Actually, it seems to be slower than `frombuffer` in my tests. It's fast enough that I'd still pick it over `frombuffer` for simplicity and being less bug-prone, but it's not actually beating `frombuffer` in speed. I wonder what's causing the difference in our observations. – user2357112 Jan 29 '19 at 22:08
  • I think that the advantage goes away entirely for larger strings. I tested with 100, 10K and 1M, and frombuffer only wins in the first case. I put the timings in my answer. – Mad Physicist Jan 29 '19 at 22:18
  • @MadPhysicist: My timings agree with yours. It looks like a size-dependent effect. – user2357112 Jan 29 '19 at 22:21
2

The fastest way I have found to convert a string to an array is

arr = np.array([mystr]).view(dtype='U1')

Another (slower) way to convert a string to an array of unicode code points based on @Daniel Mesejo's comment:

arr = np.fromiter(mystr, dtype='U1', count=len(mystr))

Looking at the source code for fromiter shows that setting the count parameter to the length of the string will cause the entire array to be allocated at once, instead of performing multiple reallocations.

To convert back to a string:

str(arr.view(dtype=f'U{arr.size}')[0])

For most purposes, the final conversion to Python str is not necessary since np.str_ is a subclass of str.

arr.view(dtype=f'U{arr.size}')[0]

Appendix: Timing of frombuffer vs array

100

mystr = ''.join(chr(random.choice(range(1, 0x1000))) for _ in range(100))

%timeit np.array([mystr]).view(dtype='U1')
1.43 µs ± 27.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit np.frombuffer(bytearray(mystr, 'utf-32-le'), dtype='U1')
1.2 µs ± 9.06 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

10000

mystr = ''.join(chr(random.choice(range(1, 0x1000))) for _ in range(10000))

%timeit np.array([mystr]).view(dtype='U1')
4.33 µs ± 13.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit np.frombuffer(bytearray(mystr, 'utf-32-le'), dtype='U1')
10.9 µs ± 29.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

1000000

mystr = ''.join(chr(random.choice(range(1, 0x1000))) for _ in range(1000000))

%timeit np.array([mystr]).view(dtype='U1')
672 µs ± 1.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit np.frombuffer(bytearray(mystr, 'utf-32-le'), dtype='U1')
732 µs ± 5.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264