You can split the strings into single characters with the array view
method:
In [18]: strarray = np.array([[b"123456"], [b"654321"]])
In [19]: strarray.dtype
Out[19]: dtype('S6')
In [20]: strarray.view('S1')
Out[20]:
array([['1', '2', '3', '4', '5', '6'],
['6', '5', '4', '3', '2', '1']],
dtype='|S1')
See here for data type character codes.
Then the most obvious next step is to use astype
:
In [23]: strarray.view('S1').astype(int)
Out[23]:
array([[1, 2, 3, 4, 5, 6],
[6, 5, 4, 3, 2, 1]])
However, it's a lot faster to reinterpret (view) the memory underlying the strings as single byte integers and subtract 48. This works because ASCII characters take up a single byte and the characters '0'
through '9'
are binary equivalent to (u)int8's 48 through 57 (check the ord
builtin).
Speed comparison:
In [26]: ar = np.array([[''.join(np.random.choice(list('123456789'), size=320))] for _ in range(1000)], bytes)
In [27]: %timeit _ = ar.view('S1').astype(np.uint8)
1 loops, best of 3: 284 ms per loop
In [28]: %timeit _ = ar.view(np.uint8) - ord('0')
1000 loops, best of 3: 1.07 ms per loop
If have Unicode instead of ASCII you need to do these steps slightly different. Or just convert to ASCII first with astype(bytes)
.