32

As mentioned in the title, I'm looking for something that can give me more performance than atoi. Presently, the fastest way I know is

atoi(mystring.c_str())

Finally, I would prefer a solution that doesn't rely on Boost. Does anybody have good performance tricks for doing this?

Additional Information: int will not exceed 2 billion, it is always positive, the string has no decimal places in it.

user788171
  • 16,753
  • 40
  • 98
  • 125
  • 11
    You're going to have a hard time beating atoi. – Joel May 30 '13 at 01:17
  • 5
    The answer to this question might depend a little on what integer range you allow. Do you want to convert *any* integer, or is your allowable input more specific? Do you know for sure that `mystring` contains *only* an integer with no other characters? Can it be negative? – paddy May 30 '13 at 01:18
  • I added some additional information, regular sized int, always positive, no decimals in the string. – user788171 May 30 '13 at 01:22
  • 7
    You're getting good answers, but I always have to wonder - do you actually know `atoi` all by itself is consuming a healthy percent of your overall time? People often ask questions like this when in fact there's something else that would yield much more speedup, but they don't know how to find such things. – Mike Dunlavey May 30 '13 at 01:55

11 Answers11

44

I experimented with solutions using lookup tables, but found them fraught with issues, and actually not very fast. The fastest solution turned out to be the least imaginitive:

int fast_atoi( const char * str )
{
    int val = 0;
    while( *str ) {
        val = val*10 + (*str++ - '0');
    }
    return val;
}

Running a benchmark with a million randomly generated strings:

fast_atoi : 0.0097 seconds
atoi      : 0.0414 seconds

To be fair, I also tested this function by forcing the compiler not to inline it. The results were still good:

fast_atoi : 0.0104 seconds
atoi      : 0.0426 seconds

Provided your data conforms to the requirements of the fast_atoi function, that is pretty reasonable performance. The requirements are:

  1. Input string contains only numeric characters, or is empty
  2. Input string represents a number from 0 up to INT_MAX
paddy
  • 60,864
  • 6
  • 61
  • 103
15

atoi can be improved upon significantly, given certain assumptions. This was demonstrated powerfully in a presentation by Andrei Alexandrescu at the C++ and Beyond 2012 conference. Hi s replacement used loop unrolling and ALU parallelism to achieve orders of magnitude in perf improvement. I don't have his materials, but this link uses a similar technique: http://tombarta.wordpress.com/2008/04/23/specializing-atoi/

Scott Jones
  • 2,880
  • 13
  • 19
  • 2
    I think I have also seen that. Is [this](https://groups.google.com/forum/?fromgroups#!topic/comp.lang.c++/1juoSwn5t20) the presentation you refer to? It's not C++ and Beyond, though. And I think it's mostly about int-to-string rather than reverse. But there is a lot to learn from that anyway. – jogojapan May 30 '13 at 01:39
  • Linked `int atoi(const char *str)` fails to detect all overflow. – chux - Reinstate Monica Mar 12 '20 at 21:51
15

This page compares conversion speed between different string->int functions using different compilers. The naive function, which offers no error checking, offers speeds roughly twice as fast as atoi(), according to the results presented.

// Taken from http://tinodidriksen.com/uploads/code/cpp/speed-string-to-int.cpp
int naive(const char *p) {
    int x = 0;
    bool neg = false;
    if (*p == '-') {
        neg = true;
        ++p;
    }
    while (*p >= '0' && *p <= '9') {
        x = (x*10) + (*p - '0');
        ++p;
    }
    if (neg) {
        x = -x;
    }
    return x;
}

it is always positive

Remove the negative checks in the above code for a micro optimization.

If you can guarantee the string will not have anything but numeric characters, you can micro optimize further by changing the loop

while (*p >= '0' && *p <= '9') {

to

while (*p != '\0' ) {

Which leaves you with

unsigned int naive(const char *p) {
    unsigned int x = 0;
    while (*p != '\0') {
        x = (x*10) + (*p - '0');
        ++p;
    }
    return x;
}
x-x
  • 7,287
  • 9
  • 51
  • 78
  • Yet the naive implementation doesn't comply to the Standard: it doesn't discard leading whitespaces. If you don't need the guarantees of the Standard.. – dyp May 30 '13 at 01:38
  • @DyP: `'\0'` will break out of the loop... it's `< '0'`. Anyway, the linked page doesn't list any functions other than this naive loop which are serious candidates to outperform atoi - they're not utilising any efficiencies coming from the insights above re an expectation of valid characters, always being positive, known maximum size, no need for any error checking.... – Tony Delroy May 30 '13 at 01:51
  • Oops you're right, brain comparison operator broken, sry.. But then you could change it to `while(*p != '\0')`.. – dyp May 30 '13 at 01:52
  • 2
    + That is basically the way I do it, although in the loop I tend to write `{x *= 10; x += (*p++ - '0');}`. It probably compiles to about the same thing. – Mike Dunlavey May 30 '13 at 02:01
  • (*p++)&15 is probably faster than *p++ - '0' – johnnycrash May 14 '14 at 15:23
  • 1
    @johnnycrash not sure why it would be. Bitwise & and constant subtraction are both one instruction. – Eric Pauley Mar 02 '17 at 20:17
  • UV for `unsigned int naive(const char *p)` – chux - Reinstate Monica Mar 12 '20 at 21:54
12

Quite a few of the code examples here are quite complex and do unnecessary work, meaning the code could be slimmer and faster.

Conversion loops are often written to do three different things with each character:

  • bail out if it is the end-of-string character
  • bail out if it is not a digit
  • convert it from its code point to the actual digit value

First observation: there is no need to check for the end-of-string character separately, since it is not a digit. Hence the check for 'digitness' covers the EOS condition implicitly.

Second observation: double conditions for range testing as in (c >= '0' && c <= '9') can be converted to a single test condition by using an unsigned type and anchoring the range at zero; that way there can be no unwanted values below the beginning of the range, all unwanted values are mapped to the range above the upper limit: (uint8_t(c - '0') <= 9)

It just so happens that c - '0' needs to be computed here anyway...

Hence the inner conversion loop can be slimmed to

uint64_t n = digit_value(*p);
unsigned d;

while ((d = digit_value(*++p)) <= 9)
{
   n = n * 10 + d;
}

The code here is called with the precondition that p be pointing at a digit, which is why the first digit is extracted without further ado (which also avoids a superfluous MUL).

That precondition is less outlandish than might appear at first, since p pointing at a digit is the reason why this code is called by the parser in the first place. In my code the whole shebang looks like this (assertions and other production-quality noise elided):

unsigned digit_value (char c)
{
   return unsigned(c - '0');
}

bool is_digit (char c)
{
   return digit_value(c) <= 9;
}

uint64_t extract_uint64 (char const **read_ptr)
{
   char const *p = *read_ptr;
   uint64_t n = digit_value(*p);
   unsigned d;

   while ((d = digit_value(*++p)) <= 9)
   {
      n = n * 10 + d;
   }

   *read_ptr = p;

   return n;
}

The first call to digit_value() is often elided by the compiler, if the code gets inlined and the calling code has already computed that value by calling is_digit().

n * 10 happens to be faster than manual shifting (e.g. n = (n << 3) + (n << 1) + d), at least on my machine with gcc 4.8.1 and VC++ 2013. My guess is that both compilers use LEA with index scaling for adding up to three values in one go and scaling one of them by 2, 4, or 8.

In any case that's exactly how it should be: we write nice clean code in separate functions and express the desired logic (n * 10, x % CHAR_BIT, whatever) and the compiler converts it to shifting, masking, LEAing and so on, inlines everything into the big bad parser loop and takes care of all the required messiness under the hood to make things fast. We don't even have to stick inline in front of everything anymore. If anything then we have to do the opposite, by using __declspec(noinline) judiciously when compilers get over-eager.

I'm using the above code in a program that reads billions of numbers from text files and pipes; it converts 115 million uints per second if the length is 9..10 digits, and 60 million/s for length 19..20 digits (gcc 4.8.1). That's more than ten times as fast as strtoull() (and just barely enough for my purposes, but I digress...). That's the timing for converting text blobs containing 10 million numbers each (100..200 MB), meaning that memory timings make these numbers appear a bit worse than they would be in a synthetic benchmark running from cache.

DarthGizka
  • 4,347
  • 1
  • 24
  • 36
7

Paddy's implementation of fast_atoi is faster than atoi - without the shadow of the doubt - however it works only for unsigned integers.

Below, I put evaluated version of Paddy's fast_atoi that also allows only unsigned integers but speeds conversion up even more by replacing costly operation * with +

unsigned int fast_atou(const char *str)
{
    unsigned int val = 0;
    while(*str) {
        val = (val << 1) + (val << 3) + *(str++) - 48;
    }
    return val;
}

Here, I put complete version of fast_atoi() that i'm using sometimes which converts singed integers as well:

int fast_atoi(const char *buff)
{
    int c = 0, sign = 0, x = 0;
    const char *p = buff;

    for(c = *(p++); (c < 48 || c > 57); c = *(p++)) {if (c == 45) {sign = 1; c = *(p++); break;}}; // eat whitespaces and check sign
    for(; c > 47 && c < 58; c = *(p++)) x = (x << 1) + (x << 3) + c - 48;

    return sign ? -x : x;
} 
soerium
  • 573
  • 4
  • 12
  • 1
    Not sure if the bit shift solution is actually faster, since x86 truncated multiplication is one instruction and gcc will know that the top bits don't matter. – Eric Pauley Mar 02 '17 at 20:24
  • Indeed, I ran benchmarks with * 10 and with the bitshift algorithm on MSVC, and I'm consistently getting better speeds with * 10. – Octo Poulos Jul 22 '22 at 05:38
  • Any modern decent compiler knows to automatically turn multiplication by some constant into shifts or whatever else if it is most efficient for the target CPU, if you turn up optimizations (and, if needed, give it enough information about exact CPU models you are optimizing for). In general, if you can know just by looking at the expression that you can optimize it to a different expression, a good modern compiler almost certainly knows too. (Still a +1 on this answer because this is good information to know.) – mtraceur Jan 15 '23 at 22:53
2

Here's the entirety of the atoi function in gcc:

long atoi(const char *str)
{
    long num = 0;
    int neg = 0;
    while (isspace(*str)) str++;
    if (*str == '-')
    {
        neg=1;
        str++;
    }
    while (isdigit(*str))
    {
        num = 10*num + (*str - '0');
        str++;
    }
    if (neg)
        num = -num;
    return num;
 }

The whitespace and negative check are superfluous in your case, but also only use nanoseconds.

isdigit is almost certainly inlined, so that's not costing you any time.

I really don't see room for improvement here.

Joel
  • 5,618
  • 1
  • 20
  • 19
  • I was able to use this to create a function template for different integer types and run it on an AVR. – Caleb Reister Jan 22 '18 at 20:05
  • "I really don't see room for improvement here." `10*num + (*str - '0')` is UB when final result s/b `LONG_MIN`. `isspace(*str)`, `isdigit(*str)` UB when `*str < 0` - not likely of concern for OP though. – chux - Reinstate Monica Mar 12 '20 at 22:08
2

A faster convert function only for positive integers without error checking.

Multiplication is always slower that sum and shift, therefore change multiply with shift.

int fast_atoi( const char * str )
{
    int val = 0;
    while( *str ) {
        val = (val << 3) + (val << 1) + (*str++ - '0');
    }
    return val;
}
Andre Kampling
  • 5,476
  • 2
  • 20
  • 47
hamSh
  • 1,163
  • 1
  • 13
  • 27
1

I did a quick benchmark of the different functions given here + some extras, and I converted them to int64_t by default. Compiler = MSVC.

Here are the results (left = normal time, right = time with overhead deduction):

atoi            : 153283912 ns => 1.000x : 106745800 ns => 1.000x
atoll           : 174446125 ns => 0.879x : 127908013 ns => 0.835x
std::stoll      : 358193237 ns => 0.428x : 311655125 ns => 0.343x
std::stoull     : 354171912 ns => 0.433x : 307633800 ns => 0.347x
-----------------------------------------------------------------
fast_null       :  46538112 ns => 3.294x :         0 ns => infx   (overhead estimation)
fast_atou       :  92299625 ns => 1.661x :  45761513 ns => 2.333x (@soerium)
FastAtoiBitShift:  93275637 ns => 1.643x :  46737525 ns => 2.284x (@hamSh)
FastAtoiMul10   :  93260987 ns => 1.644x :  46722875 ns => 2.285x (@hamSh but with *10)
FastAtoiCompare :  86691962 ns => 1.768x :  40153850 ns => 2.658x (@DarthGizka)
FastAtoiCompareu:  86960900 ns => 1.763x :  40422788 ns => 2.641x (@DarthGizka + uint)
-----------------------------------------------------------------
FastAtoi32      :  92779375 ns => 1.652x :  46241263 ns => 2.308x (handle the - sign)
FastAtoi32u     :  86577312 ns => 1.770x :  40039200 ns => 2.666x (no sign)
FastAtoi32uu    :  87298600 ns => 1.756x :  40760488 ns => 2.619x (no sign + uint)
FastAtoi64      :  93693575 ns => 1.636x :  47155463 ns => 2.264x
FastAtoi64u     :  86846912 ns => 1.765x :  40308800 ns => 2.648x
FastAtoi64uu    :  86890537 ns => 1.764x :  40352425 ns => 2.645x
FastAtoiDouble  :  90126762 ns => 1.701x :  43588650 ns => 2.449x (only handle int)
FastAtoiFloat   :  92062775 ns => 1.665x :  45524663 ns => 2.345x (same)

DarthGizka's code is the fastest and has the advantage of stopping when the char is non-digit.

Also, the bitshifting "optimization" is a tiny bit slower than just doing * 10.

The benchmark runs each algorithm with 10 million iterations on a pseudo-random string, to limit the branch prediction as much as possible, and then it re-runs everything 15 more times. For each algorithm, the 4 slowest and 4 fastest times are discarded, and the result given is the average of the 8 median times. This provides a lot of stability. Also, I run fast_null in order to estimate the overhead in the benchmark (loop + string changes + function call), and then this value is deducted in the second numbers.

Here is the code for the functions:

int64_t fast_null(const char* str) { return (str[0] - '0') + (str[1] - '0'); }

int64_t fast_atou(const char* str)
{
    int64_t val = 0;
    while (*str) val = (val << 1) + (val << 3) + *(str++) - 48;
    return val;
}

int64_t FastAtoiBitShift(const char* str)
{
    int64_t val = 0;
    while (*str) val = (val << 3) + (val << 1) + (*str++ - '0');
    return val;
}

int64_t FastAtoiMul10(const char* str)
{
    int64_t val = 0;
    while (*str) val = val * 10 + (*str++ - '0');
    return val;
}

int64_t FastAtoiCompare(const char* str)
{
    int64_t val = 0;
    uint8_t x;
    while ((x = uint8_t(*str++ - '0')) <= 9) val = val * 10 + x;
    return val;
}

uint64_t FastAtoiCompareu(const char* str)
{
    uint64_t val = 0;
    uint8_t  x;
    while ((x = uint8_t(*str++ - '0')) <= 9) val = val * 10 + x;
    return val;
}

int32_t FastAtoi32(const char* str)
{
    int32_t val  = 0;
    int     sign = 0;
    if (*str == '-')
    {
        sign = 1;
        ++str;
    }
    uint8_t digit;
    while ((digit = uint8_t(*str++ - '0')) <= 9) val = val * 10 + digit;
    return sign ? -val : val;
}

int32_t FastAtoi32u(const char* str)
{
    int32_t val = 0;
    uint8_t digit;
    while ((digit = uint8_t(*str++ - '0')) <= 9) val = val * 10 + digit;
    return val;
}

uint32_t FastAtoi32uu(const char* str)
{
    uint32_t val = 0;
    uint8_t  digit;
    while ((digit = uint8_t(*str++ - '0')) <= 9) val = val * 10u + digit;
    return val;
}

int64_t FastAtoi64(const char* str)
{
    int64_t val  = 0;
    int     sign = 0;
    if (*str == '-')
    {
        sign = 1;
        ++str;
    }
    uint8_t digit;
    while ((digit = uint8_t(*str++ - '0')) <= 9) val = val * 10 + digit;
    return sign ? -val : val;
}

int64_t FastAtoi64u(const char* str)
{
    int64_t val = 0;
    uint8_t digit;
    while ((digit = uint8_t(*str++ - '0')) <= 9) val = val * 10 + digit;
    return val;
}

uint64_t FastAtoi64uu(const char* str)
{
    uint64_t val = 0;
    uint8_t  digit;
    while ((digit = uint8_t(*str++ - '0')) <= 9) val = val * 10u + digit;
    return val;
}

float FastAtoiFloat(const char* str)
{
    float   val = 0;
    uint8_t x;
    while ((x = uint8_t(*str++ - '0')) <= 9) val = val * 10.0f + x;
    return val;
}

double FastAtoiDouble(const char* str)
{
    double  val = 0;
    uint8_t x;
    while ((x = uint8_t(*str++ - '0')) <= 9) val = val * 10.0 + x;
    return val;
}

And the benchmark code I used, just in case...

void Benchmark()
{
    std::map<std::string, std::vector<int64_t>> funcTimes;
    std::map<std::string, std::vector<int64_t>> funcTotals;
    std::map<std::string, int64_t>              funcFinals;

#define BENCH_ATOI(func)                     \
    do                                       \
    {                                        \
        auto    start    = NowNs();          \
        int64_t z        = 0;                \
        char    string[] = "000001987";      \
        for (int i = 1e7; i >= 0; --i)       \
        {                                    \
            string[0] = '0' + (i + 0) % 10;  \
            string[1] = '0' + (i + 1) % 10;  \
            string[2] = '0' + (i + 3) % 10;  \
            string[3] = '0' + (i + 5) % 10;  \
            string[4] = '0' + (i + 9) % 10;  \
            z += func(string);               \
        }                                    \
        auto elapsed = NowNs() - start;      \
        funcTimes[#func].push_back(elapsed); \
        funcTotals[#func].push_back(z);      \
    }                                        \
    while (0)

    for (int i = 0; i < 16; ++i)
    {
        BENCH_ATOI(atoi);
        BENCH_ATOI(atoll);
        BENCH_ATOI(std::stoll);
        BENCH_ATOI(std::stoull);
        //
        BENCH_ATOI(fast_null);
        BENCH_ATOI(fast_atou);
        BENCH_ATOI(FastAtoiBitShift);
        BENCH_ATOI(FastAtoiMul10);
        BENCH_ATOI(FastAtoiCompare);
        BENCH_ATOI(FastAtoiCompareu);
        //
        BENCH_ATOI(FastAtoi32);
        BENCH_ATOI(FastAtoi32u);
        BENCH_ATOI(FastAtoi32uu);
        BENCH_ATOI(FastAtoi64);
        BENCH_ATOI(FastAtoi64u);
        BENCH_ATOI(FastAtoi64uu);
        BENCH_ATOI(FastAtoiFloat);
        BENCH_ATOI(FastAtoiDouble);
    }

    for (auto& [func, times] : funcTimes)
    {
        std::sort(times.begin(), times.end(), [](const auto& a, const auto& b) { return a < b; });
        fmt::print("{:<16}: {}\n", func, funcTotals[func][0]);
        int64_t total = 0;
        for (int i = 4; i <= 11; ++i) total += times[i];
        total /= 8;
        funcFinals[func] = total;
    }

    const auto base     = funcFinals["atoi"];
    const auto overhead = funcFinals["fast_null"];
    for (const auto& [func, final] : funcFinals)
        fmt::print("{:<16}: {:>9} ns => {:.3f}x : {:>9} ns => {:.3f}x\n", func, final, base * 1.0 / final, final - overhead, (base - overhead) * 1.0 / (final - overhead));
}
Octo Poulos
  • 523
  • 6
  • 5
0

Why not use a stringstream? I'm not sure of its particular overhead, but you could define:

int myInt; 
string myString = "1561";
stringstream ss;
ss(myString);
ss >> myInt;

Of course, you'd need to

#include <stringstream> 
Rome_Leader
  • 2,518
  • 9
  • 42
  • 73
  • 4
    That's the canonical C++ way but it is several orders of magnitude slower than a slimmed 'naive' conversion loop. – DarthGizka Nov 14 '14 at 22:01
0

The only definitive answer is with checking with your compiler, your real data.

Something I'd try (even if it's using memory accesses so it may be slow depending on caching) is

int value = t1[s[n-1]];
if (n > 1) value += t10[s[n-2]]; else return value;
if (n > 2) value += t100[s[n-3]]; else return value;
if (n > 3) value += t1000[s[n-4]]; else return value;
... continuing for how many digits you need to handle ...

if t1, t10 and so on are statically allocated and constant the compiler shouldn't fear any aliasing and the machine code generated should be quite decent.

6502
  • 112,025
  • 15
  • 165
  • 265
-1

Here is mine. Atoi is the fastest I could come up with. I compiled with msvc 2010 so it might be possible to combine both templates. In msvc 2010, when I combined templates it made the case where you provide a cb argument slower.

Atoi handles nearly all the special atoi cases, and is as fast or faster than this:

int val = 0;
while( *str ) 
    val = val*10 + (*str++ - '0');

Here is the code:

#define EQ1(a,a1) (BYTE(a) == BYTE(a1))
#define EQ1(a,a1,a2) (BYTE(a) == BYTE(a1) && EQ1(a,a2))
#define EQ1(a,a1,a2,a3) (BYTE(a) == BYTE(a1) && EQ1(a,a2,a3))

// Atoi is 4x faster than atoi.  There is also an overload that takes a cb argument.
template <typename T> 
T Atoi(LPCSTR sz) {
    T n = 0;
    bool fNeg = false;  // for unsigned T, this is removed by optimizer
    const BYTE* p = (const BYTE*)sz;
    BYTE ch;
    // test for most exceptions in the leading chars.  Most of the time
    // this test is skipped.  Note we skip over leading zeros to avoid the 
    // useless math in the second loop.  We expect leading 0 to be the most 
    // likely case, so we test it first, however the cpu might reorder that.
    for ( ; (ch=*p-'1') >= 9 ; ++p) { // unsigned trick for range compare
      // ignore leading 0's, spaces, and '+'
      if (EQ1(ch, '0'-'1', ' '-'1', '+'-'1'))
        continue;
      // for unsigned T this is removed by optimizer
      if (!((T)-1 > 0) && ch==BYTE('-'-'1')) {
        fNeg = !fNeg;
        continue;
      }
      // atoi ignores these.  Remove this code for a small perf increase.
      if (BYTE(*p-9) > 4)  // \t, \n, 11, 12, \r. unsigned trick for range compare
        break;
    }
    // deal with rest of digits, stop loop on non digit.
    for ( ; (ch=*p-'0') <= 9 ; ++p) // unsigned trick for range compare
      n = n*10 + ch; 
    // for unsigned T, (fNeg) test is removed by optimizer
    return (fNeg) ? -n : n;
}

// you could go with a single template that took a cb argument, but I could not
// get the optimizer to create good code when both the cb and !cb case were combined.
// above code contains the comments.
template <typename T>
T Atoi(LPCSTR sz, BYTE cb) {
    T n = 0;
    bool fNeg = false; 
    const BYTE* p = (const BYTE*)sz;
    const BYTE* p1 = p + cb;
    BYTE ch;
    for ( ; p<p1 && (ch=*p-'1') >= 9 ; ++p) {
      if (EQ1(ch,BYTE('0'-'1'),BYTE(' '-'1'),BYTE('+'-'1')))
        continue;
      if (!((T)-1 > 0) && ch == BYTE('-'-'1')) {
        fNeg = !fNeg;
        continue;
      }
      if (BYTE(*p-9) > 4)  // \t, \n, 11, 12, \r
        break;
    }
    for ( ; p<p1 && (ch=*p-'0') <= 9 ; ++p)
      n = n*10 + ch; 
    return (fNeg) ? -n : n;
}
johnnycrash
  • 5,184
  • 5
  • 34
  • 58