3

I have a very long string in Python:

x = "12;14;14;14;18;12;17;19" # I only show a small part of it : there are 10 millions of ;

The goal is to transform it into:

y = array([12, 14, 14, 14, 18, 12, 17, 19], dtype=int)

One way to do this is to use array(x.split(";")) or numpy.fromtostring.

But both are extremely slow.

Is there quicker way to do it in python?

Thank you very much and have a nice day.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
AdrienC
  • 59
  • 3
  • 2
    Does this answer your question? [Most efficient way to split strings in Python](https://stackoverflow.com/questions/9602856/most-efficient-way-to-split-strings-in-python) – Cow Dec 21 '22 at 08:52
  • Try `np.fromstring(x, dtype=np.int, sep=';')` – Olvin Roght Dec 21 '22 at 08:52
  • Hello, thank you for the answer. Unfortunately it performs the same way :( – AdrienC Dec 21 '22 at 08:54
  • Have you tried textwrap? – Rob Dec 21 '22 at 08:58
  • You could maybe give it a try with `numba` and write the results into a pre-allocated Numpy array maybe. – Mark Setchell Dec 21 '22 at 10:38
  • What is “extremely slow”? Is there a reason to expect that a faster way exists? I don’t see a particular reason why these operation should perform slower than necessary: they are probably reasonably optimised. – Konrad Rudolph Dec 21 '22 at 11:01
  • 3
    Strings are slow, especially unicode ones. CPython is slow, especially loops. Numpy is not really design to (efficiently) deal with strings. Unfortunately Numba is very slow for strings/bytes so far (open issue that is not planed to be solved any time soon). The only possible solutions I see are cheats/tricks, or using C/Cython. I do not think Numpy can do this faster than `fromstring`. Numba could certainly with some tricks. Are the number always made of 2 digits? Are they always small? – Jérôme Richard Dec 21 '22 at 11:16

1 Answers1

6

String parsing is often slow. Unicode decoding often make things slower (especially when there are non-ASCII character) unless it is carefully optimized (hard). CPython is slow, especially loops. Numpy is not really design to (efficiently) deal with strings. I do not think Numpy can do this faster than fromstring yet. The only solutions I can come up with are using Numba, Cython or even basic C extensions. The simplest solution is to use Numba, the fastest is to use Cython/C-extensions.

Unfortunately Numba is very slow for strings/bytes so far (this is an open issue that is not planed to be solved any time soon). Some tricks are needed so that Numba can compute this efficiently: the string needs to be converted to a Numpy array. This means it must be first encoded to a byte-array first to avoid any variable-sized encoding (like UTF-8). np.frombuffer seems the fastest solution to convert the buffer to a Numpy array. Since the input is a read-only array (unusual, but efficient), the Numba signature is not very easy to read.

Here is the final solution:

import numpy as np
import numba as nb

@nb.njit(nb.int32[::1](nb.types.Array(nb.uint8, 1, 'C', readonly=True,)))
def compute(arr):
    sep = ord(';')
    base = ord('0')
    minus = ord('-')
    count = 1
    for c in arr:
        count += c == sep
    res = np.empty(count, np.int32)
    val = 0
    positive = True
    cur = 0
    for c in arr:
        if c != sep and c != minus:
            val = (val * 10) + c - base
        elif c == minus:
            positive = False
        else:
            res[cur] = val if positive else -val
            cur += 1
            val = 0
            positive = True
    if cur < count:
        res[cur] = val if positive else -val
    return res

x = ';'.join(np.random.randint(0, 200, 10_000_000).astype(str))
result = compute(np.frombuffer(x.encode('ascii'), np.uint8))

Note that the Numba solution performs no checks for sake of performance. It also assume the numbers are positive ones. Thus, you must ensure the input is valid. Alternatively, you can perform additional checks in it (at the expense of a slower code).

Here are performance results on my machine with a i5-9600KF processor (with Numpy 1.22.4 on Windows):

np.fromstring(x, dtype=np.int32, sep=';'):       8927 ms
np.array(re.split(";", x), dtype=np.int32):      1419 ms
np.array(x.split(";"), dtype=np.int32):          1196 ms
Numba implementation:                              78 ms
Numba implementation (without negative numbers):   67 ms

This solution is 114 times faster than np.fromstring and 15 times faster than the fastest solution (based on split). Note that removing the support for negative numbers makes the Numba function 18% faster. Also, note that 10~12% of the time is spent in encode. The rest of the time comes from the main loop in the Numba function. More specifically, the conditionals in the loop are the main source of the slowdown because they can hardly predicted by the processor and they prevent the use of fast (SIMD) instructions. This is often why string parsing is slow.

A possible improvement is to use a branchless implementation operating on chunks. Another possible improvement is to compute the chunks using multiple threads. However, both optimizations are tricky to do and they both make the resulting code significantly harder to read (and so to maintain).

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • "Strings are slow, especially unicode ones." — what exactly do you mean by that? I think without context this is a fairly useless characterisation and I would even call it wrong. There isn't anything *inherently* slow with strings, and Unicode strings aren't inherently slower *unless* you actually need to handle Unicode specifics. Parsing numbers can be made as fast for Unicode strings as for other string. – Konrad Rudolph Dec 21 '22 at 13:51
  • 1
    The problem with unicode strings is that the variable size code makes automatic SIMD computation nearly impossible for compilers. Since intrinsics are not available in Numba (and are not portable anyway) and there is no SIMD interface for that, the assembly code will be a scalar one. Scalar codes are slow because they are bound to >=1 cycle per character (in fact generally several cycle per character are needed). Native unbounded C-strings have the similar issue though it is possible to compute them quite efficiently using chunks since the memory is pretty slow too (see "*Memory Wall*"). – Jérôme Richard Dec 21 '22 at 17:51
  • 1
    I agree with you that dealing with unicode specifics is what make unicode string slow and if there are only ASCII characters then it is not so slow, but still, in CPython, the string has to be copied unless one deal with low level C extension (or maybe Cython). This copy is a bit expensive but not critical here (same for the decoding). AFAIK, CPython do the decoding pretty efficiently thanks to a library doing the operation using manual SIMD intrinsics. – Jérôme Richard Dec 21 '22 at 17:59
  • 1
    90% of the time is spent on the computation of the character array which is not really a string. What makes the computation not very fast in Numba is basically the use of conditionals (in a loop with dependency chains) and this is more generally due to the parsing. Conditionals are slow when they cannot be predicted and this is often the case for string parsing. Moreover, dependency chains and conditional prevent any automatic vectorization of the loop. In fact, this is hard to do efficiently manually because of the variable size of the integers. – Jérôme Richard Dec 21 '22 at 18:04
  • 1
    Unicode string parsing can indeed have close performance for UTF-8 ASCII-based strings but not the same due to the ASCII check. UCS-4 ASCII strings are generally significantly slower because they are 4x bigger in memory and in SIMD registers (compared to a basic byte-string. This is what Numpy uses internally. – Jérôme Richard Dec 21 '22 at 18:11
  • 1
    Thus, maybe, I should have said that "*string parsing is slow*". Theoretically, string parsing (including unicode decoding) can often be made fast with SIMD intrinsics but most people do not use them for strings (nor portable library), and compilers are generally far from being smart enough to do this job efficiently. In the end, this is why I said "strings are slow". This is not very precise nor very exact but I think it is important to summary this complex subject to a simple idea (simplistic?). – Jérôme Richard Dec 21 '22 at 18:19