0

What's the fastest way to convert a string represented by (const char*, size_t) to an int?

The string is not null-terminated. Both these ways involve a string copy (and more) which I'd like to avoid.

And yes, this function is called a few million times a second. :p

int to_int0(const char* c, size_t sz)
{
    return atoi(std::string(c, sz).c_str());
}

int to_int1(const char* c, size_t sz)
{
    return boost::lexical_cast<int>(std::string(c, sz));
}
XTF
  • 1,091
  • 1
  • 13
  • 31
  • if you're using C++11 there would be no copies, because the move constructor would be invoked on the temporary string passed to `atoi` – Tony The Lion Mar 08 '12 at 15:48
  • 1
    @TonyTheLion: You still need to copy the data once to create a `string` object (or a zero-terminated string). – Jerry Coffin Mar 08 '12 at 15:52

6 Answers6

3

Given a counted string like this, you may be able to gain a little speed by doing the conversion yourself. Depending on how robust the code needs to be, this may be fairly difficult though. For the moment, let's assume the easiest case -- that we're sure the string is valid, containing only digits, (no negative numbers for now) and the number it represents is always within the range of an int. For that case:

int to_int2(char const *c, size_t sz) { 
    int retval = 0;
    for (size_t i=0; i<sz; i++)
        retval *= 10;
        retval += c[i] -'0';
    }
    return retval;
}

From there, you can get about as complex as you want -- handling leading/trailing whitespace, '-' (but doing so correctly for the maximally negative number in 2's complement isn't always trivial [edit: see Nawaz's answer for one solution to this]), digit grouping, etc.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
3

Another slow version, for uint32:

void str2uint_aux(unsigned& number, unsigned& overflowCtrl, const char*& ch)
{
    unsigned digit = *ch - '0';
    ++ch;

    number = number * 10 + digit;

    unsigned overflow = (digit + (256 - 10)) >> 8;
    // if digit < 10 then overflow == 0
    overflowCtrl += overflow;
}

unsigned str2uint(const char* s, size_t n)
{
    unsigned number = 0;
    unsigned overflowCtrl = 0;

    // for VC++10 the Duff's device is faster than loop
    switch (n)
    {
    default:
        throw std::invalid_argument(__FUNCTION__ " : `n' too big");

    case 10: str2uint_aux(number, overflowCtrl, s);
    case  9: str2uint_aux(number, overflowCtrl, s);
    case  8: str2uint_aux(number, overflowCtrl, s);
    case  7: str2uint_aux(number, overflowCtrl, s);
    case  6: str2uint_aux(number, overflowCtrl, s);
    case  5: str2uint_aux(number, overflowCtrl, s);
    case  4: str2uint_aux(number, overflowCtrl, s);
    case  3: str2uint_aux(number, overflowCtrl, s);
    case  2: str2uint_aux(number, overflowCtrl, s);
    case  1: str2uint_aux(number, overflowCtrl, s);
    }

    // here we can check that all chars were digits
    if (overflowCtrl != 0)
        throw std::invalid_argument(__FUNCTION__ " : `s' is not a number");

    return number;
}

Why it's slow? Because it processes chars one-by-one. If we'd had a guarantee that we can access bytes upto s+16, we'd can use vectorization for *ch - '0' and digit + 246.
Like in this code:

    uint32_t digitsPack = *(uint32_t*)s - '0000';
    overflowCtrl |= digitsPack | (digitsPack + 0x06060606); // if one byte is not in range [0;10), high nibble will be non-zero
    number = number * 10 + (digitsPack >> 24) & 0xFF;
    number = number * 10 + (digitsPack >> 16) & 0xFF;
    number = number * 10 + (digitsPack >> 8) & 0xFF;
    number = number * 10 + digitsPack & 0xFF;
    s += 4;

Small update for range checking:
the first snippet has redundant shift (or mov) on every iteration, so it should be

unsigned digit = *s - '0';
overflowCtrl |= (digit + 256 - 10);
...
if (overflowCtrl >> 8 != 0) throw ...
Abyx
  • 12,345
  • 5
  • 44
  • 76
  • Even this may have per-iteration branching (calling `str2uint_aux` if it isn't inlined). It also imposes an arbitrary limitation on the size of number that can be processed (e.g., it's not adequate for a 64-bit type). That could be expanded, but only at the expense of roughly doubling an already large routine. Dealing with signed numbers would increase that yet more. That *may* be worthwhile, but may almost as easily use enough extra cache to be slower overall. – Jerry Coffin Mar 08 '12 at 20:27
  • 1
    @JerryCoffin: the main point of this code is that there is no need to use `s[i] >= '0' && s[i] <= '9'`. Complete or partial loop unrolling, C macro vs inline functions - are minor aspects of this code. – Abyx Mar 08 '12 at 21:20
  • I understand what you were after, and yes, coalescing the overflows together instead of treating them individually is a decent idea -- but unless you expect quite a bit of bad data, branch prediction will work well here, so your extra work is likely to gain little (if anything). – Jerry Coffin Mar 08 '12 at 21:26
  • @Abyx: I don't see why there is no need to use `s[i] >= '0' && s[i] <= '9'`? What if there is some non-digit character? Explain that. – Nawaz Mar 09 '12 at 15:03
  • 1
    @Nawaz: for non-digit characters, the `overflowCtrl` variable will be non-zero. There is no need to use *comparisons* and branching - just substract `'0'`, add `246` and if it were not-a-digit, result will be greater than 255. So, you can just collect that overflows and check them at once. – Abyx Mar 09 '12 at 16:54
  • @Abyx: How exactly? What if the non-digit character has smaller value than `'0'`? Take for example, `s= "786*76"`, here is one non-digit character which is `'*'`. So according to you : `('*' - '0' + 246) = (42 - 48 + 246) = 240`. So `240` is greater than `255`? – Nawaz Mar 09 '12 at 17:03
  • 1
    @Nawaz: `digit` is unsigned, so `'*' - '0'` is `0xf...fffffffa`, it's greater than `0xff`. – Abyx Mar 09 '12 at 17:07
2

Fastest:

int to_int(char const *s, size_t count)
{
     int result = 0;
     size_t i = 0 ;
     if ( s[0] == '+' || s[0] == '-' ) 
          ++i;
     while(i < count)
     {
          if ( s[i] >= '0' && s[i] <= '9' )
          {
              //see Jerry's comments for explanation why I do this
              int value = (s[0] == '-') ? ('0' - s[i] ) : (s[i]-'0');
              result = result * 10 + value;
          }
          else
              throw std::invalid_argument("invalid input string");
          i++;
     }
     return result;
} 

Since in the above code, the comparison (s[0] == '-') is done in every iteration, we can avoid this by calculating result as negative number in the loop, and then return result if s[0] is indeed '-', otherwise return -result (which makes it a positive number, as it should be):

int to_int(char const *s, size_t count)
{
     size_t i = 0 ;
     if ( s[0] == '+' || s[0] == '-' ) 
          ++i;
     int result = 0;
     while(i < count)
     {
          if ( s[i] >= '0' && s[i] <= '9' )
          {
              result = result * 10  - (s[i] - '0');  //assume negative number
          }
          else
              throw std::invalid_argument("invalid input string");
          i++;
     }
     return s[0] == '-' ? result : -result; //-result is positive!
} 

That is an improvement!


In C++11, you could however use any function from std::stoi family. There is also std::to_string family.

Nawaz
  • 353,942
  • 115
  • 666
  • 851
  • 1
    Note that this will fail for the maximally negative number on a two's complement system... – Jerry Coffin Mar 08 '12 at 15:57
  • 3
    @Abyx: for what it does, it's about as fast as you can get. What I posted is probably a little faster, but only at the expense of being less robust (i.e., makes no attempt at dealing with negative numbers or with a string containing anything except digits). – Jerry Coffin Mar 08 '12 at 15:58
  • @JerryCoffin: Could you please elaborate on that a bit more? – Nawaz Mar 08 '12 at 15:59
  • 2
    Sure: Let's assume 16-bit integers for the moment, and a string containing `"-32768"`. This tries to convert `32768` to an int, but that will overflow because the largest 16-bit signed int is `32767`. – Jerry Coffin Mar 08 '12 at 16:01
  • @JerryCoffin: Ohh.. I see. But that is possible even for bigger *positive* numbers as well. – Nawaz Mar 08 '12 at 16:02
  • 2
    Yes, if the number in the string represents a number too large to fit in an int, overflow is basically inevitable. In this case, however, the string represents a number that *can* fit in an int, but the function still won't produce the correct result (at least not dependably). – Jerry Coffin Mar 08 '12 at 16:05
  • @JerryCoffin: Ohh.. *now* I understand why you said about *negative* number. Thanks. :) – Nawaz Mar 08 '12 at 16:08
  • 1
    Yup, I think that's got it. Hopefully @Abyx (or whoever downvoted) will remove it now. One more minor point: for maximum speed, it's *probably* better to have two separate loops, one for positive and the other for negative numbers, so you only do that test once instead of every iteration. – Jerry Coffin Mar 08 '12 at 16:19
  • @JerryCoffin: Yes. I was thinking of doing that, then thought it would be too verbose. But I hope OP sees your comment, and does it himself. – Nawaz Mar 08 '12 at 16:27
  • It's **slow** because there is a lot of conditional branchings. Also compiler may not optimize access-by-index. – Abyx Mar 08 '12 at 17:43
  • 1
    @Abyx: If you can see a way to eliminate the branching, please do tell. Compilers that couldn't optimize array indexing to be as efficient as pointer notation have been obsolete for decades. – Jerry Coffin Mar 08 '12 at 17:57
  • Thanks. I was hoping a library function for this was available in C++ or Boost, it's kinda disappointing there isn't. – XTF Mar 08 '12 at 18:05
  • @JerryCoffin: see my answer below. There is no branching in every iteration. – Abyx Mar 08 '12 at 19:23
  • @JerryCoffin: I improved the code, without using *two separate loops* . I think my solution is better than *two separate loops* idea. – Nawaz Jun 12 '12 at 11:23
  • @JerryCoffin: Saying *Nice* is better than an upvote. Thanks :D – Nawaz Jun 12 '12 at 15:55
1
llvm::StringRef s(c,sz);
int n;
s.getAsInteger(10,n);
return n;

http://llvm.org/docs/doxygen/html/classllvm_1_1StringRef.html

Benjamin Lindley
  • 101,917
  • 9
  • 204
  • 274
  • 1
    Since it needs to be *really* fast, it's better not use that since there are calls and branches for radix autochecks in referenced child function `GetAsUnsignedInteger()`. – Jonas Byström Mar 08 '12 at 16:04
0

If you run the function that often, I bet you parse the same number many times. My suggestion is to BCD encode the string into a static char buffer (you know it's not going to be very long, since atoi only can handle +-2G) when there's less than X digits (X=8 for 32 bit lookup, X=16 for 64 bit lookup) then place a cache in a hash map.

When you're done with the first version, you can probably find nice optimizations, such as skipping the BCD encoding entirely and just using X characters in the string (when length of string <= X) for lookup in the hash table. If the string is longer, you fallback to atoi.

Edit: ... or fallback instead of atoi to Jerry Coffin's solution, which is as fast as they come.

Jonas Byström
  • 25,316
  • 23
  • 100
  • 147
0

You'll have to either write custom routine or use 3rd party library if you're dead set on avoiding string copy.

You probably don't want to write atoi from scratch (it is still possible to make a bug here), so I'd advise to grab existing atoi from public domain or BSD-licensed code and modify it. For example, you can get existing atoi from FreeBSD cvs tree.

SigTerm
  • 26,089
  • 6
  • 66
  • 115