Optimal Base-10 only itoa() function?

Question

In 20+ years programming in C I've used a base other than 10 once, so when I found my trusty MSVC's _itoa() missing in another environment, I set out to write one that only does base 10, and puts the destination buffer argument, pointing to the storage returned by the function, on the left, instead of on the right, like all of the string functions in the C Standard Library. I believe this code is also thread-safe.

Is there a faster way to do this?

I was also going to ask about correctness, but I believe the included test code proves it works, even for the particular case of LONG_MIN, which is (-1 * LONG_MAX) -1, which caused a failure in the code until I changed tactics, noted the sign, and then copied the signed int to an unsigned int. I then did all of the core work in the function in unsigned ints - which happily ran in 75% of the time as well.

char * _i32toa(char *const rtn, int32_t i)    {
    if (NULL == rtn) return NULL;

    // declare local buffer, and write to it back-to-front
    char buff[12];
    uint32_t  ut, ui;
    char minus_sign=0;
    char *p = buff + sizeof(buff)-1;
    *p-- = 0;    // nul-terminate buffer

    // deal with negative numbers while using an unsigned integer
    if (i < 0)    {
        minus_sign = '-';
        ui = (uint32_t)((int)-1 * (int)i);
    }    else    {
        ui = i;
    }

    // core code here...
    while (ui > 9) {
        ut = ui;
        ui /= 10;
        *p-- = (ut - (ui * 10)) + '0';
    }
    *p = ui + '0';

    if ('-' == minus_sign) *--p = minus_sign;

    // knowing how much storage we needed, copy chars from buff to rtn...
    memcpy(rtn, p, sizeof(buff)-(p - buff));

    return rtn;
}

// ------------------------------------------------------------------------------------------------
#define LOOP_KNT (SHRT_MAX * 1024)
// ------------------------------------------------------------------------------------------------
int main(void)    {
    time_t start = clock();

    int32_t t = 123456, i;
    char *buff = (char *)malloc(256);

    for (i = (SHRT_MIN *1024); i < LOOP_KNT; i++)    {
        _i32toa(buff, i);
    }
    printf("\nElapsed time was %f milliseconds", (double)clock() - (double)(start));

    start = clock();
    for (i = (SHRT_MIN * 1024); i < LOOP_KNT; i++)    {
        _itoa(i, buff, 10);
    }
    printf("\nElapsed time was %f milliseconds", (double)clock() - (double)(start));

    start = clock();
    for (i = (SHRT_MIN * 1024); i < LOOP_KNT; i++)    {
        ___itoa(i, buff, 10);
    }
    printf("\nElapsed time was %f milliseconds", (double)clock() - (double)(start));

    printf("\nString from integer %i is %s\n", t, _i32toa(buff, t));
    printf("\nString from integer %i is %s\n", -0, _i32toa(buff, -0));
    printf("\nString from integer %i is %s\n", -1, _i32toa(buff, -1));
    printf("\nString from integer %i is %s\n", LONG_MIN, _i32toa(buff, LONG_MIN));

    start = clock();
    for (int i = LONG_MIN; i < LONG_MAX; i++) {
        if (i != atoi(_i32toa(buff, (int32_t)i))) {
            printf("\nError for %i", i);
        }
        if (!i) printf("\nAt zero");
    }
    printf("\nElapsed time was %f milliseconds", (double)clock() - (double)(start));

    getchar();
    return 0;
}

Performance is 2-4X that of the not-part-of-the-C-standard _itoa() in Visual Studio 2013, and 10-15X that of sprintf().

The approach is somewhat novel, and depends on knowing the required buffer size for the completed string - a problem the function allocating it's own string buffer, buff[] solves, making it thread-safe at the same time.

Knowing where the end of the buffer is allows the characters of the string to be written from the back to the front, solving the reverse order problem. The calling function doesn't need to prepare *rtn in any way, as the working string that gets memcpy()ed to *ptr is already null-terminated.

TVMIA for your feedback. The lack of a good _atoi() function is a persistent enough problem it deserves a good solution. Let's make one.

PS: On my i7 Hazwell box running MSVS C++ 64-bit with full optimizations, the full loop from LONG_MIN to LONG_MAX averages 116 clocks per conversion, for the round-trip, and only 28 clocks for _itoa(). That's over 725 megabytes per second of string - if comparing to Ben Voigt's code. I think I won Ben!

Even faster ideas here on the C++ version of your question: http://stackoverflow.com/questions/4351371/c-performance-challenge-integer-to-stdstring-conversion — Ben Voigt, Feb 01 '14 at 18:26
A better place for this Q would be http://codereview.stackexchange.com/ — Brian Roach, Feb 01 '14 at 18:29
@Brian: codereview.se is not really about performance improvements. — Ben Voigt, Feb 01 '14 at 18:31
@BenVoigt erm, then why does the about page specifically list "performance" in the little graphic about what questions to ask? (Not trying to be snarky - honest Q, I don't participate there) — Brian Roach, Feb 01 '14 at 18:33
@Brian: As far as I can tell, codereview.se is all about getting the best implementation of a particular algorithm, not looking for alternate algorithms. So implementation problems affecting performance are covered. But a question like this is not. — Ben Voigt, Feb 01 '14 at 18:36
@KarolyHorvath: Of course it is, every invocation of the function has its own automatic `buff` variable. — Ben Voigt, Feb 01 '14 at 18:36
I would make them pass you the length of the buffer so you can check you are not overflowing the buffer. Also I would just return a pointer into the buffer instead of doing the memcpy(). — brian beuning, Feb 01 '14 at 18:47
@brianbeuning buff doesn't survive the function, so that's a NO-GO. I considered the length check, but that's really a decision the calling function should make, as we wouldn't want to saddle _i32toa() with that chore where it isn't needed. Also, if they've screwed up allocating the appropriate storage, why believe they will get the length right? I did consider it at length though, and having the source, you could certainly add it where desired. — , Feb 01 '14 at 18:54
Use '0' instead of 48, as it reads better and gives the reader more information about what you are doing. — Thomas Matthews, Feb 01 '14 at 19:04
@BenVoigt, assuming an average strlen(rtn) of 5, or a bit more with the negative signs on the first LONG_MIN, given an average speed of ~ 30 million per second, it's pumping out over 150 MBs, so it's competitive with those in the challenge you cited above. — , Feb 01 '14 at 19:23
@RocketRoy: Your test machine is definitely newer and probably also faster than the ones used for benchmarking in my question... and therefore a competitive answer shouldn't merely tie, it should win handily. (Furthermore, the answers in my question were required to also create a `std::string` -- the raw formatting code is substantially faster) I'm sure you can learn from several of the answers there. Or did you come here just to hear "yes, that's the fastest way of doing it"? Well, it isn't fastest, although it is fast. — Ben Voigt, Feb 01 '14 at 19:57
possible duplicate of [Fast integer to decimal conversion](http://stackoverflow.com/questions/10488468/fast-integer-to-decimal-conversion) — Ben Voigt, Feb 01 '14 at 20:14
@BenVoigt, I compiled up the code cited directly above. It runs in ~ the same time as MSVC's _itoa(), which is to say, ~ 3X slower than my code above, so I don't think it's a duplicate. — , Feb 02 '14 at 20:55
@BenVoigt, I had some time last night to play with some of the code in your coding challenge. Pretty clever solving the reverse order of characters by indexing into an array where they're already reversed. I have an idea I think can improve on that performance using the same general approach, but so far, it's just an idea. Would you be willing to declare a new winner if I can beat the best in your challenge? — , Feb 02 '14 at 20:57
@RocketRoy: While you won't be the accepted answer according to the rules that set a particular threshold to beat, I certainly would update the note at the bottom of my question that reports which is fastest. Go ahead and add another answer there (put at least the code in your answer, let's learn from the dissappearing ideone pages) and if it benchmarks faster, that note will direct people to prefer your version. — Ben Voigt, Feb 02 '14 at 21:02
@RocketRoy: BTW your code might not be a duplicate of any answer to the other C question, but the topic is a duplicate. Your code would be a good answer to that question also (but I'm convinced you'll go yet faster if you mix in some ideas from the C++ challenge) — Ben Voigt, Feb 02 '14 at 21:04
@BenVoigt. I made simple but serious error, I left the atoi() call in the benchmark I used to test the correctness of the code. _i32toa() by itself runs in 28 clocks on my Hazwell box, so around 30 clocks on my previous benchmark machine, which was a Sandy Bridge i7. I think my code is outperforming yours? — , Nov 12 '14 at 00:39
@RocketRoy: I don't know what you're using for the benchmark measurements. Best way to figure out if you're beating the best solutions on my question is to run yours and theirs on the same computer with the exact same data. — Ben Voigt, Nov 12 '14 at 01:03
@BenVoigt, I'm using the same code as always, but sans the atoi() used to verify correctness, running a loop from LONG_MIN to LONG_MAX and converting the i to a string. That's the only valid benchmark as it covers the entire range, and the timings are for the average signed int. I can't figure out what code ended up the winner on your challenge. The results are rather muddled at this point. Perhaps you could send me a remark attached to the winning answer? — , Nov 12 '14 at 01:22
@RocketRoy: Timo answered while the question was closed, so he edited his code into my existing answer here: http://stackoverflow.com/a/4351465/103167 To make it work in C, replace the final line with the final memcpy you are using. BTW your CPU is about six generations newer than his was; getting 750 MB/s from his code would not surprise me. — Ben Voigt, Nov 12 '14 at 01:28
why do you use `(int)-1 * (int)i` when only `-i` is enough? You don't need to calculate the remainder by `ut - (ui * 10)` because most architectures provide the remainder along with the division result. Even if that architecture's div instruction doesn't return the remainder, the compiler can still be smart enough to optimize it accordingly. — phuclv, Nov 12 '14 at 04:09
@LưuVĩnhPhúc, so how would you make this remainder materialize in a way that's useful? — , Nov 12 '14 at 06:14
@LưuVĩnhPhúc, and slower. I assume you didn't bother to actually cut, paste and run the code? That was too much trouble? An undue burden? I so appreciate these "drive by" comments. Go away. — , Nov 12 '14 at 07:09
The compiler is smarter than many people to know that it can transform a division by constant to a multiplication. Making the code readable will provide the compiler different ways to optimize the code. Wise people will measure the performance before... — phuclv, Nov 12 '14 at 09:16
It has probably been lost in all the muddle, but this code is faster than anything in Ben Voigt's challenge and takes up a lot fewer resources to boot. — , Sep 01 '15 at 09:32
Your version is indeed faster than that lone surviving code in Ben's "question." But if we simply change the last line of that function and, instead of creating a std::string (really Ben???) we perform a memcpy to an user allocated buffer, just like you do, that function **is 30% faster than yours**. And that is on a machine running with Spectre and Meltdown protections! So not only you lost, but you lost big. By the way, even though I agree with the idea expressed in a couple of your replies, you really should work on your social skills. — user1593842, May 04 '20 at 22:26
Claims in these comments about `ui % 10` being slower than `ut - (ui * 10)` should be presumed to be a failure to turn up compiler optimizations until proven otherwise. Far too many people test performance on default compiler settings, not realizing that most compilers will generate profoundly inefficient machine code by default (to make debugging easier). In other words, @phuclv was probably right all along, and is even more probably right by the time you are reading this since compiler optimizations keep improving. — mtraceur, Jan 16 '23 at 04:49
`(uint32_t)((int)-1 * (int)i)` will be undefined behavior on all modern C implementations for all modern hardware (2's complement integer representation being the main relevant attribute) if `i` is the minimum value that `int32_t` can hold. It is also needlessly redundant with its casts (`-1` is inherently an `int`, for example). Replacing that whole right side of the `=` sign with `i * (uint32_t)-1` would be perfectly well-defined to do the right thing on all standard-conforming implementations of C which have `uint32_t` and `int32_t`. — mtraceur, Jan 16 '23 at 05:16

Thomas Matthews · Answer 1 · 2014-02-01T19:34:33.323

3

You can eliminate the memcpy by writing directly into the caller's memory area.
You should have the caller pass the size of the buffer.

The other bottleneck is division, but I don't see how to get around that.

Edit 1: correct initialization of buffer pointer

char * _i32toa(char *const rtn, unsigned int buff_size, int32_t i)  
{
    if (NULL == rtn) return NULL;

    uint32_t  ut, ui;
    char minus_sign=0;
    char *p = rtn + buff_size - 1;
    // As before, without memcpy.
    return rtn;
}

edited Feb 01 '14 at 19:34

answered Feb 01 '14 at 19:10

Thomas Matthews

56,849
17
98
154

1

The auto buff[] and memcpy() allowed me to write to the back of the buff knowing in almost all cases p != buff[] at the end of the function. I adjust for the partially used buffer having a gap between the 1st byte of the string and the front of the buffer by memcpy()-ing the auto buff[] into the correct position in the caller's string buffer. It's either that or you have to know how many bytes you'll need before you start writing the string. – Feb 01 '14 at 19:26
If the caller passes the size of the buffer, you know where the back is and you start from there. – Thomas Matthews Feb 01 '14 at 19:30
1

That tells you where the back of his potentially very long buffer is, but it doesn't tell you where to start writing chars in that buffer in reverse order so you will end up writing the most significant digit at the head of his buffer. – Feb 01 '14 at 19:34
See my Edit 1 where I show the initialization of `p`. – Thomas Matthews Feb 01 '14 at 19:34
So, you can hard code the buffer size or change the definition to the ending index in the buffer of the result. – Thomas Matthews Feb 01 '14 at 19:36
3

That would only work if buff_size were the eventual strlen(buff[]). – Feb 01 '14 at 19:37
Thomas. If the caller is using a 16 byte buffer your buff_size = 16. If I am converting 13 the last 3 bytes of your 16 byte buffer are ASCII 1,3,'\0' which leaves the calling function exactly 29 bytes of guaranteed garbage at *const rtn. No C programmer on Earth would accept this as a usable function. – Nov 04 '16 at 10:50
@RocketRoy: Unless the application is timing critical, 29 iterations is insignificant. More time is wasted waiting for User's response or I/O to finish. – Thomas Matthews Nov 04 '16 at 15:22
You must be high. Your buff_size would need to be strlen(), using more processor power than my memcpy(). Of course you can't call strlen() because you don't have the string until this function creates it. If your buff_size is 32 and the string is "123" your buffer has 29 bytes of garbage (which could, and probably would contain ASCII digits and 1-n null terminators) 3 useful chars and an ASCII zero - unlike mine where the 1st 4 bytes contain exactly what is expected, the useful string. Seriously, you're embarrassing yourself. Read the standard C library and study how functions are written. – Nov 08 '16 at 01:25

score 1 · Answer 2 · answered Feb 01 '14 at 19:34

1

Get rid of the auto char array and make them pass the size so you can check for overflow.

#define I32TOA( buff, val ) _i32toa( (buff), sizeof(buff), (val) )

char * _i32toa(char *const rtn, size_t size, int32_t i)    {
    if (NULL == rtn) return NULL;

    uint32_t  ut, ui;
    char minus_sign=0;
    char *p = rtn + size-1;
    *p-- = 0;    // nul-terminate buffer
    assert( p >= rtn );

    if (i < 0)    {
        minus_sign = '-';
        ui = (uint32_t)((int)-1 * (int)i);
    }    else    {
        ui = i;
    }

    while (ui > 9) {
        ut = ui;
        ui /= 10;
        *p-- = (ut - (ui * 10)) + 48;
        assert( p >= rtn );
    }
    *p = ui + 48;

    if ('-' == minus_sign) {
        *--p = minus_sign;
        assert( p >= rtn );
    }

    return p;
}

answered Feb 01 '14 at 19:34

brian beuning

2,836
18
22

1

Assume he allocated a buffer in the calling function, and a pointer to it, like char foo[], *p=foo; if he passes p as the arg, the pointer he gets back is no longer equal to p, or &foo[0]; This approach would create the same problem that forces you to always update any pointer passed to realloc(), IE: the pointer value has changed. – Feb 01 '14 at 19:44
Then why bother returning a pointer at all? – brian beuning Feb 01 '14 at 21:30
2

The returned pointer tells the user where the string you built is located in his buffer. BTW, your macro to check the size of the buffer is getting the size of a pointer, not the size of the buffer, except in the special cases where the buffer is a buffer of fixed size declared in the same scope as the calling function. I changed my code to call malloc() to help illustrate this point. – Feb 02 '14 at 20:48

Optimal Base-10 only itoa() function?

2 Answers2

Linked