String parsing is often slow. Unicode decoding often make things slower (especially when there are non-ASCII character) unless it is carefully optimized (hard). CPython is slow, especially loops. Numpy is not really design to (efficiently) deal with strings. I do not think Numpy can do this faster than fromstring
yet. The only solutions I can come up with are using Numba, Cython or even basic C extensions. The simplest solution is to use Numba, the fastest is to use Cython/C-extensions.
Unfortunately Numba is very slow for strings/bytes so far (this is an open issue that is not planed to be solved any time soon). Some tricks are needed so that Numba can compute this efficiently: the string needs to be converted to a Numpy array. This means it must be first encoded to a byte-array first to avoid any variable-sized encoding (like UTF-8). np.frombuffer
seems the fastest solution to convert the buffer to a Numpy array. Since the input is a read-only array (unusual, but efficient), the Numba signature is not very easy to read.
Here is the final solution:
import numpy as np
import numba as nb
@nb.njit(nb.int32[::1](nb.types.Array(nb.uint8, 1, 'C', readonly=True,)))
def compute(arr):
sep = ord(';')
base = ord('0')
minus = ord('-')
count = 1
for c in arr:
count += c == sep
res = np.empty(count, np.int32)
val = 0
positive = True
cur = 0
for c in arr:
if c != sep and c != minus:
val = (val * 10) + c - base
elif c == minus:
positive = False
else:
res[cur] = val if positive else -val
cur += 1
val = 0
positive = True
if cur < count:
res[cur] = val if positive else -val
return res
x = ';'.join(np.random.randint(0, 200, 10_000_000).astype(str))
result = compute(np.frombuffer(x.encode('ascii'), np.uint8))
Note that the Numba solution performs no checks for sake of performance. It also assume the numbers are positive ones. Thus, you must ensure the input is valid. Alternatively, you can perform additional checks in it (at the expense of a slower code).
Here are performance results on my machine with a i5-9600KF processor (with Numpy 1.22.4 on Windows):
np.fromstring(x, dtype=np.int32, sep=';'): 8927 ms
np.array(re.split(";", x), dtype=np.int32): 1419 ms
np.array(x.split(";"), dtype=np.int32): 1196 ms
Numba implementation: 78 ms
Numba implementation (without negative numbers): 67 ms
This solution is 114 times faster than np.fromstring
and 15 times faster than the fastest solution (based on split
). Note that removing the support for negative numbers makes the Numba function 18% faster. Also, note that 10~12% of the time is spent in encode
. The rest of the time comes from the main loop in the Numba function. More specifically, the conditionals in the loop are the main source of the slowdown because they can hardly predicted by the processor and they prevent the use of fast (SIMD) instructions. This is often why string parsing is slow.
A possible improvement is to use a branchless implementation operating on chunks. Another possible improvement is to compute the chunks using multiple threads. However, both optimizations are tricky to do and they both make the resulting code significantly harder to read (and so to maintain).