Fast cycling through decimal digits for printing (embedded)

Question

In my FW for real-time embedded processor I need formatted print of decimal number. Standard printf/sprintf are not available in toolchain, so I need implement it myself.

I used naive approach of dividing by tens and taking remainder. But my target processor doesn't support division natively and software implementation take very long time (more than 200us) to compute. I wonder if there is quick way to fetch decimal digits from a number without division?

char* os_prn_decimal(char* outBuf, const char* end, uint32 v)
{
    uint32 dgtIdx = 1000000000;

    do
    {
        uint8 dgt = (uint8)(v / dgtIdx);

        *outBuf = dgt + '0';
        ++outBuf;

       v = v % dgtIdx;
        dgtIdx /= 10;
    } while (outBuf < end && dgtIdx > 0);
    return outBuf;
}

The short answer is **NO**. You are converting from one base to another, so integer division and modulo is what you need. But perhaps there's other optimization you can do. For example, see if you can figure out how to do this without the `(uint8)` casting. You could also do an initial check of the base-10 magnitude of the number to see if it exceeds your buffer size rather than checking `end` in every loop. Besides getting (slightly) better speed (for large values of `v`), this would solve the issue of outputting incomplete data when the buffer is too small. — daShier, Sep 08 '19 at 13:46
Incomplete data is my least concern tbh :) 200us is equivalent to 4K operations. 10 checks for `< end` in not a big deal. But for base conversion itself, maybe there are some math tricks for division by tens. Worth checking with smart people on StackOverflow anyway. Thanks for reply — Noob, Sep 08 '19 at 15:20
Well this might help: https://stackoverflow.com/questions/5558492/divide-by-10-using-bit-shifts — daShier, Sep 08 '19 at 15:33
Perhaps if you mention what your actual target is, you would get good answers. Manual optimization without a specific target in mind doesn't make much sense most of the time. — Lundin, Sep 08 '19 at 20:58
@Lundin I stated very clear what is relevant limitation of my processor - lack of division. Providing "actual target" will not give any more information since this is proprietary DSP or my company. — Noob, Sep 10 '19 at 10:59
Knowing its word size is kind of fundamental... you can't really write any C code if you don't know that. — Lundin, Sep 10 '19 at 11:08

Clifford · Answer 1 · 2019-09-08T18:12:43.657

Your solution generates the digits in the right order directly, but at a cost of a variable divide (v / dgtIdx), a variable modulo (which has the same or greater cost than a divide), and a divide by 10. That is three expensive operations.

It may be less expensive to generate the digits from the least-significant digit first, then reverse the digits after. That will then require only a divide by 10 and a modulo 10 operation. Using the solution at Divide by 10 using bit shifts? and modifying it to obtain the remainder in the same operation as the quotient:

uint32_t div10_rem( uint32_t dividend, int* remainder )
{
    uint32_t quotient = (uint32_t)((0x1999999Aull * dividend) >> 32) ;
    *remainder = dividend - (quotient * 10) ;

    return quotient ;
}

Then the conversion to a displayable decimal string might be:

char* int2dec( uint32_t val, char* buffer )
{
    char reverse_digits[10] = {0} ;
    uint32_t u = val ;
    size_t digit_count = 0 ;

    while( u > 0 )
    {
        int d = 0 ;
        u = div10_rem( u, &d ) ;
        reverse_digits[digit_count] = d + '0' ;
        digit_count++ ;
    }

    buffer[digit_count] = '\0' ;
    size_t i = 0 ;
    for( size_t i = 0; i < digit_count; i++ )
    {
        buffer[i] = reverse_digits[digit_count - i - 1] ;
    }

    return buffer ;
}

Then an example usage:

    char buffer[11] ;
    printf( "%s", int2dec( val, buffer) ) ;

It is possible to avoid the digit reversal if a static buffer is suitable:

#define MAX_DIGITS 10
const char* int2dec( uint32_t val )
{
    static char digits[MAX_DIGITS + 1] = {0} ;
    uint32_t u = val ;
    size_t digit_index = MAX_DIGITS - 1 ;

    while( u > 0 )
    {
        int d = 0 ;
        u = div10_rem( u, &d ) ;
        digits[digit_index] = d + '0' ;
        digit_index-- ;
    }

    return &digits[digit_index + 1] ;
}

Then, for example:

    printf( "%s", int2dec( val ) ) ;

Yes, that's the similar to what daShier referred in his comment https://stackoverflow.com/questions/5558492/divide-by-10-using-bit-shifts. It is all good, but it has 64-bit multiplication, which it is a bit problematic for my specific environment - such operation is not supported on the processor. Anyway, thanks for answer. — Noob, Sep 08 '19 at 17:24
@Noname : It is not 64 bit multiply, it is a 32x32bit multiply with 64 bit result. It does not have to be supported by the processor, just by the compiler. A compiler provided software 32x32 bit multiply will be faster than a software divide on any platform and will be supported by whatever hardware multiplication is available. — Clifford, Sep 08 '19 at 17:37
@Noname : It is exactly what daSher referred to, but he only suggested it might help - it was a long way from an answer. It your processor and toolchain apply such constraints, it would help if you were to specify what they were - (the processor and toolchain, not the constraints). — Clifford, Sep 08 '19 at 17:59
no need to argue about semantic. In your terms, I meant to say that my processor supports only 16x16 multiplication. — Noob, Sep 08 '19 at 18:07
@Noname : Yes, but you have missed my point; that is true of many processors - the compiler can still synthesise integer operations for word sizes greater than the hardware support, and that will still be faster than a divide. Your shift solution may be faster still, but does not present a complete solution to the question - you did not ask how to divide by 10 with quotient and remainder specifically - that is a means to a solution, not a solution. — Clifford, Sep 08 '19 at 18:16

Noob · Accepted Answer · 2019-09-10T19:30:28.223

1

A hint from daShier helped me to rectify my googling and I found this article https://forum.arduino.cc/index.php?topic=167414.0 that describes interesting approach to division by 10, which provides both quotient and modulo. Best part of it is complete lack of multiplications, divisions and cycles.

UPD: simulation measurement showed ~2X better performance of this solution compared to alternative solution and ~6X better performance over my original implementation.

void divmod10(uint32_t in, uint32_t &div, uint32_t &mod)
{
 // q = in * 0.8;
 uint32_t q = (in >> 1) + (in >> 2);
 q = q + (q >> 4);
 q = q + (q >> 8);
 q = q + (q >> 16);  // not needed for 16 bit version

 // q = q / 8;  ==> q =  in *0.1;
 q = q >> 3;

 // determine error
 uint32_t  r = in - ((q << 3) + (q << 1));   // r = in - q*10;
 div = q + (r > 9);
 if (r > 9) mod = r - 10;
 else mod = r;
}

edited Sep 10 '19 at 19:30

answered Sep 08 '19 at 17:20

Noob

335
1
8

I am unclear now whether it is the divide by 10 that was taking 200us or the whole algorithm? This is not a solution to the generation of a decimal string, just the divide by 10, which was only a single operation in your original algorithm. How have you used this to produce the string and what is the performance now (out of interest)? – Clifford Sep 08 '19 at 18:06
Preliminary estimation (based on disassembly) is ~14us per single print. Need run simulation to see actual performance. As for "this is not solution" - my question was "how fetch decimal digit without division". So as for me pretty much an answer :) – Noob Sep 08 '19 at 18:18
OK, but your question starts "I need formatted print of decimal number.", then gives a current implementation that is more that just fetching "a decimal digit without division". Good improvement though. And a nice answer - even if to a somewhat different question ;-) – Clifford Sep 08 '19 at 18:21
Coz StackOverflow requires provide some background for question. I'm just trying to be nice and follow guidelines :D – Noob Sep 08 '19 at 18:32
Ok, but it is not perhaps clear what is background and what is the actual question, and the meaning of the title eludes me entirely - but sees to alude to multiple digits suggesting the whole conversion algorithm. Moreover, if the aim is to optimise the complete conversion, asking about the complete conversion seems entirely reasonable - especially if that is what you are measuring to test the performance. If this is a complete and valid answer (and it is, because it is your question), then I would suggest that your question is not entirely clear. – Clifford Sep 08 '19 at 19:04
A question is usually a sentence ending with question mark. I thought it's quite obvious. – Noob Sep 08 '19 at 19:39
1

The compiler should be able to do these kind of optimizations for you if division is slow on your MCU. If this for some AVR then the problem is the presence of 32 bit arithmetic in the first place, and when optimizing you should start considering what you can do to get rid of that part. – Lundin Sep 08 '19 at 20:54
I am trying to get you to provide information and cast your question in a manner that will get you better answers and better contribute to the community resource. If I were searching for a means of doing fast divide by 10 quotient+remainder, I don't think this question would come up, and if it did I would not immediately see that it was relevant. SO serves two purposes 1) solve your problem, 2) provide a resource for solving problems. – Clifford Sep 09 '19 at 13:26
If the only person to have posted an acceptable answer is yourself, you have to ask "Was my question clear?" and look at it from the point of view of the reader. If there are comments from anyone with significant rep suggesting improvement, you might consider them rather then argue with those trying to assist you that the question is fine as it is! – Clifford Sep 09 '19 at 13:28
Clifford you posted same answer as daShier. I thanked you even though he was first. I found my solution thanks to his hint and I explained why I think it is better - there is no multiplication with 64-bit result, which is not supported on my processor. Don't understand why you continue this argument. It is not contributing to anything. – Noob Sep 10 '19 at 10:54
@Clifford so I've run simulation with both implementations. This is my test code: `log_trace("%u", 0xFFFFFFFF);` which I assumed worst case scenario with maximal number of digits. 669 cycles (~16us) with "shift" solution and 1173 cycles (~59us) with "multiply" solution. – Noob Sep 10 '19 at 19:19

Fast cycling through decimal digits for printing (embedded)

2 Answers2