2

I created an assembly program that implements biginteger to calculate big fibonacci numbers. The program works and using gdb I can verify that the numbers are correctly stored in memory.

The problem comes when I want to print the numbers. Normally I would just use printf but AFAIK it doesn't support numbers bigger than 64-bit. This means that I need to implement the conversion myself. I know that I need to calculate remainders mod 10 but I don't know how to do that for numbers so big.


Code

section .text
global _start

_start: mov eax, 192
        xor ebx, ebx
        mov ecx, 2048
        mov edx, 0x6
        mov esi, 0x22
        mov edi, -1
        xor ebp, ebp
        int 0x80 ; allocate memory
        lea r8, [eax] ; a
        lea r9, [eax+1024] ; b
        mov qword [r8], 0x1
        mov qword [r9], 0x1
        mov rbx, 0x2 ; fib num counter
        jmp fib

fib:    inc rbx ; num counter++
        mov rsi, 0x0 ; addr for int
        mov rdx, 128; counter for int
        clc ; clear carry flag
add:    mov rax, [r8+8*rsi]
        adc rax, [r9+8*rsi] ; rax = a + b
        mov rcx, [r8+8*rsi]
        mov [r9+8*rsi], rcx ; b = a
        mov [r8+8*rsi], rax ; a = a + b
        inc rsi
        dec rdx
        jnz add ; repeat for whole bigint
        call print
        cmp rbx, 100 ; repeat for 100 fib nums
        jl fib
        jmp exit

print:  ; what to do here?
        ret    

exit:   mov eax, 0x1
        xor ebx, ebx
        int 0x80
apilat
  • 1,370
  • 8
  • 17
  • Use packed BCD instead of binary, then conversion and calculations are much simpler. – Shift_Left Mar 11 '18 at 22:29
  • 1
    Extracting digits is elementary school mathematics, where exactly did you get stuck? E.g. start from 1, multiply by 10 in every step until you get a power that is bigger than your number then go and subtract each to get the appropriate digit. That is unless you want to [implement division](https://en.wikipedia.org/wiki/Division_algorithm). – Jester Mar 11 '18 at 22:40
  • 1
    Using [`int 0x80` in 64-bit code is highly not recommended](https://stackoverflow.com/questions/46087730/what-happens-if-you-use-the-32-bit-int-0x80-linux-abi-in-64-bit-code/46087731#46087731), especially for system calls that take pointers. – Peter Cordes Mar 12 '18 at 04:12
  • 1
    You don't need to exchange the data between buffers every time. Instead, swap pointers outside the inner loop. (Also, `mov rcx, [r8+8*rsi]` is silly. You already loaded that into `rax`, but then clobbered it with `adc`. You should have simply done `mov rcx, rax` before the `adc`). And BTW, if you use `add rbx, 1`, that will clear CF so you can drop the `clc` instruction. – Peter Cordes Mar 12 '18 at 04:16
  • 2
    I wrote an extended-precision Fibonacci to print the first 1000 digits of Fib(10^9), heavily optimized for code-size [105 bytes of x86 32-bit machine code](https://codegolf.stackexchange.com/questions/133618/extreme-fibonacci/135618#135618). It runs in about a minute on a 4GHz Skylake. To simplify division by powers of 10 (for printing, and for keeping only the leading 1009 digits), **I stored data in 32-bit chunks, in base 10^9 rather than base 2^32**. That meant doing the carry-out manually with a compare after `add`, but it made division much cheaper and printing simple. – Peter Cordes Mar 12 '18 at 04:21
  • For higher performance, 64-bit chunks of base 10^19 would work in 64-bit code. I used 32-bit because I was optimizing for code-size as well as performance, with much more emphasis on code-size. (Although this code-golf question also had an interesting performance requirement, which was fun.) See my code-golf answer for lots more discussion about it, including working printing code for numbers in that format. – Peter Cordes Mar 12 '18 at 04:22
  • @Jester the problem with that approach is that the powers of 10 quickly overflow and my numbers are way bigger than 64-bit – apilat Mar 12 '18 at 14:50
  • 2
    They don't overflow, because you use the same arbitrary precision arithmetic. – Jester Mar 12 '18 at 16:02
  • I can't seem to get your method. Could you provide code which shows how it works? – apilat Mar 12 '18 at 16:24
  • 1
    Using your own large number code, calculate powers of 10, then subtract repeatedly. – Jester Mar 12 '18 at 16:54
  • 1
    @apilat: Keep multiplying by 10 (extended precision) and comparing, until you get a number larger than your input. The power of 10 *before* that is the one you want. Then subtract it repeatedly from your input, and count how many times until the value becomes negative. This count is your leading digit. Keep the value from before it became negative, and repeat until you get a value that fits in 64 bits. (Use `printf` or other efficient algorithm on that.) This avoids any extended-precision division by 10, but does lots of mul and sub. – Peter Cordes Mar 15 '18 at 06:31

1 Answers1

3

There's an interesting classical algorithm called double-dabble (easy to find many references) that uses only shift and add to convert binary unsigned integers to binary-coded decimal (BCD). BCD is very simple to print as ASCII decimal.

For arbitrary-length integers double-dabble is not particularly efficient, but it's simple to implement.

Here's C code for multi-word double-dabble. It assumes unsigned ints are 32 bits and works fine on my machine where that's true. One could make it more portable, but I'm not bothering because you intend to translate to assembly anyway.

The multi-word shift and add can be much smaller, simpler, and faster in assembly language because you have direct access to carry bits.

#include <stdio.h>

typedef unsigned int word;

// Print BCD in x of given size.
void print(word *x, int size) {
  for (int i = size - 1; i >= 0; --i)
    for (int j = 28; j >= 0; j -= 4) {
      unsigned d = (x[i] >> j) & 0xf;
      putchar('0' + d);
    }
  putchar('\n');
}

// Shift x of given size in words left one bit
void shift_left(word *x, int size) {
  for (int i = size - 1; i > 0; --i) x[i] = (x[i] << 1) | (x[i - 1] >> 31);
  x[0] <<= 1;
}

// Add small constant c to x of given size in words.
void add(word *x, int size, word c) {
  word x0 = x[0] + c;
  word carry = x0 < x[0];
  x[0] = x0;
  for (int i = 1; carry && i < size; ++i) {
    word xi = x[i] + carry;
    carry = xi < x[i];
    xi = xi;
  } 
}

// Do the double dabble increment.
void double_dable_increment(word *bcd, int size) {
  for (int i = 0; i < size; ++i) {
    for (int j = 28; j >= 0; j -= 4) {
      word digit = (bcd[i] >> j) & 0xfu;
      if (digit >= 5) add(bcd + i, size - i, 3u << j);
    }
  }
}

// Convert binary in the low words of x into BCD in the high words.
void convert(word *x, int bin_size, int bcd_size) {
  for (int i = 0; i < 32 * bin_size; ++i) {
    double_dable_increment(x + bin_size, bcd_size);
    shift_left(x, bin_size + bcd_size);
  }
}

// Warning: Not portable. For simplicity, assumes 32-bit words.
#define BCD_BITS_PER_BINARY_BIT (1.20411998267)
#define WORDS_FROM_BITS(N) (((int)(N) + 31) / 32)
#define BCD_WORDS_FROM_BINARY_WORDS(N) \
  WORDS_FROM_BITS(BCD_BITS_PER_BINARY_BIT * 32 * (N))

// Test problem uses 4 words.
#define SIZE 4
#define BCD_SIZE BCD_WORDS_FROM_BINARY_WORDS(SIZE)

int main(void) {
  // Binary rep of 12345678901234567890123456789
  word x[10] = { 0xae398115, 0xaadfa328u, 0x6015fec5u, 0x5ce0e9a5u, };
  convert(x, SIZE, BCD_SIZE);
  print(x + SIZE, BCD_SIZE);
  return 0;
}

And if we run it,

$ ./a.out
0123456789012345678901234567890123456789
Gene
  • 46,253
  • 4
  • 58
  • 96
  • 1
    https://en.wikipedia.org/wiki/Double_dabble has an example with an 8-bit integer. And BTW, if the desired output format is ASCII not BCD, you can probably dabble into 4 bits per byte, propagating carry manually between digits instead of having hardware do it for you within bytes / words. Wikipedia has a C implementation that (I think) works this way. – Peter Cordes Mar 15 '18 at 06:39
  • @PeterCordes Sure. Thanks. But he's calculating consecutive Fibonacci results as an array of words and wants a routine to print those. – Gene Mar 15 '18 at 13:37
  • 1
    Which part are you saying "but" about? If the example, seeing how it works a step at a time for 8 bits is very useful for showing how it works. If the extended-precision array of words: producing an array of bytes vs. an array of nibbles doesn't change how you handle an input array of qword elements for extended-precision bit-shifts. Unpacking BCD to decimal is pretty cheap with SSE2, but if it saves some masking for the compares (or allows `pcmpgtb` to find elements that need an add) then 4 bits per byte instead of packed BCD can make extended-precision double dabble cheaper. – Peter Cordes Mar 15 '18 at 13:47