Quite a few of the code examples here are quite complex and do unnecessary work, meaning the code could be slimmer and faster.
Conversion loops are often written to do three different things with each character:
- bail out if it is the end-of-string character
- bail out if it is not a digit
- convert it from its code point to the actual digit value
First observation: there is no need to check for the end-of-string character separately, since it is not a digit. Hence the check for 'digitness' covers the EOS condition implicitly.
Second observation: double conditions for range testing as in (c >= '0' && c <= '9')
can be converted to a single test condition by using an unsigned type and anchoring the range at zero; that way there can be no unwanted values below the beginning of the range, all unwanted values are mapped to the range above the upper limit: (uint8_t(c - '0') <= 9)
It just so happens that c - '0'
needs to be computed here anyway...
Hence the inner conversion loop can be slimmed to
uint64_t n = digit_value(*p);
unsigned d;
while ((d = digit_value(*++p)) <= 9)
{
n = n * 10 + d;
}
The code here is called with the precondition that p
be pointing at a digit, which is why the first digit is extracted without further ado (which also avoids a superfluous MUL).
That precondition is less outlandish than might appear at first, since p
pointing at a digit is the reason why this code is called by the parser in the first place. In my code the whole shebang looks like this (assertions and other production-quality noise elided):
unsigned digit_value (char c)
{
return unsigned(c - '0');
}
bool is_digit (char c)
{
return digit_value(c) <= 9;
}
uint64_t extract_uint64 (char const **read_ptr)
{
char const *p = *read_ptr;
uint64_t n = digit_value(*p);
unsigned d;
while ((d = digit_value(*++p)) <= 9)
{
n = n * 10 + d;
}
*read_ptr = p;
return n;
}
The first call to digit_value()
is often elided by the compiler, if the code gets inlined and the calling code has already computed that value by calling is_digit()
.
n * 10
happens to be faster than manual shifting (e.g. n = (n << 3) + (n << 1) + d
), at least on my machine with gcc 4.8.1 and VC++ 2013. My guess is that both compilers use LEA
with index scaling for adding up to three values in one go and scaling one of them by 2, 4, or 8.
In any case that's exactly how it should be: we write nice clean code in separate functions and express the desired logic (n * 10, x % CHAR_BIT, whatever) and the compiler converts it to shifting, masking, LEAing and so on, inlines everything into the big bad parser loop and takes care of all the required messiness under the hood to make things fast. We don't even have to stick inline
in front of everything anymore. If anything then we have to do the opposite, by using __declspec(noinline)
judiciously when compilers get over-eager.
I'm using the above code in a program that reads billions of numbers from text files and pipes; it converts 115 million uints per second if the length is 9..10 digits, and 60 million/s for length 19..20 digits (gcc 4.8.1). That's more than ten times as fast as strtoull()
(and just barely enough for my purposes, but I digress...). That's the timing for converting text blobs containing 10 million numbers each (100..200 MB), meaning that memory timings make these numbers appear a bit worse than they would be in a synthetic benchmark running from cache.