Large integer radix/base conversion from 10^x to 2^x

Question

Preface

I'm learning about computer math by writing and refining my own BigInt library. So far my first incarnation stores every digit of a base 10 number in successive elements of a vector. It can multiply and add with arbitrary precision. I want to speed it up by using all of the space available to me in a standard C++ data type by converting to base 2^x.

Information

I am reading 1000 or more digits from stdin in base 10 and I wish to convert them to base 2^x so I may store them easily in an array or vector of one of the standard C++ data types, likely unsigned int. I have only one idea for how to do base conversion, the repeated division with remainder method. Here is some C++ code describing that method:

vector<int> digits;
while(num!=0) {
   int mod = num%base;
   num = num/base;
   digits.push_back(mod);
}

Conundrums

Some things I'm lost on are whether or not division with remainder is the proper way to do radix conversion on large integers. I've tried looking at how the GMP library does it. gmp/mpn/generic/set_str.c is the relevant c source file where "the magic" happens, but I am not certain what is going on in there. Matt McCutchen's BigInt appears to use the repeated division with remainder method. If I do use this method, I essentially need write two versions of my BigInt class, one for working in Base10 and the other for Base2^x.

Conclusion

Offer advice on the proper steps for converting a huge number from a string to an array of 32 bit words.
Help me learn how GMP converts a string to an array of 32bit words without wading through many layers of abstraction.

Examples Using 4 bit word size

Number we want to store (obviously on the small size): 123456789

unsigned chars have a range of 0-255, if we want to split up our number and store it in the vector we can do it one of three ways:

As base 10, our vector looks like: [1,2,3,4,5,6,7,8,9]
- This is what my vector looked like in my first implementation.
As base 100, our vector looks like: [1,23,45,67,89]
- Easy to convert from base 10 to base 100, has ciel(digits in base10/2) elements.
As base 256, our vector looks like: [7,91,205,21]

Obviously the third solution is the most optimal for internal representation, and is what I am trying to get to.

Is there any chance that you're confusing the string-to-bigint conversion with the bigint-to-string conversion? — Mark Ransom, Jun 06 '11 at 22:54
@MarkRansom - No, I'm trying to read in a string and store it as a bigint. — frontendloader, Jun 06 '11 at 23:01
A bigint class is probably better off storing larger chunks, not smaller. Instead of going from 10 down to 2, you're probably better off going up to 2^31-1 or something. — user1277476, Aug 16 '12 at 21:30
the book C Interfaces and Implementations http://www.amazon.com/Interfaces-Implementations-Techniques-Creating-Reusable/dp/0201498413 has a complete worked bigint library, explained in detail. it's not a *great* book (it's ok, just not exceptional), but it sounds like it would be perfect for you here. — andrew cooke, Aug 16 '12 at 22:09

Mark Ransom · Accepted Answer · 2011-06-07T02:16:48.780

6

Once you have multiply and add functions working for your bigint library, converting a string to bigint is simplicity itself. Start with a conversion result of zero. For each digit you process (left to right), multiply the previous result by 10 and add the value of the new digit (using your bigint multiply and add functions).

edited Jun 07 '11 at 02:16

answered Jun 06 '11 at 22:47

Mark Ransom

299,747
42
398
622

1

Although it should be noted that converting the BigInt to an integer is pretty much the most complex function you have to implement for a BigInt library ;) – Voo Jun 07 '11 at 00:23
This answer missed the point of the question. If I read in each digit of 1234 and store it in a vector it'd look something like [1,2,3,4]. I want to be able use the full range of those individual char elements. 1234 can be represented as two char elements [4,210]. 4*(256^1) + 210*(256^0) = 1234 – frontendloader Jun 07 '11 at 01:30
1

@Dezzimal: For each digit you read, multiply x by 10, then add the digit to x (x is your base 256 variable). For "1000", x progresses like this: [0], [1], [10], [100], [3,232]. – Tom Sirgedas Jun 07 '11 at 01:38
I also managed to miss the simplicity of this answer, @Tom enlightened me. – frontendloader Jun 07 '11 at 02:57
Highschool maths, folks =) Remember the 1s, 10s and 100s columns from primary school =) Excellent and clear answer, though. – ioquatix Dec 15 '11 at 23:41
If anyone wants to see the start of a C++ bigint class in base 1000000000, see http://stackoverflow.com/a/22004815/5987 – Mark Ransom Apr 01 '14 at 15:37

score 1 · Answer 2 · answered Jun 06 '11 at 22:51

In general, to convert one base to another (from most significant digit to least), the algorithm is as follows:

output = 0
foreach digit in digits:
    output = output * base + digit

In the reverse order, it's the following:

output = 0
multiplier = 1
foreach digit in digits:
    output = output + multiplier * digit
    multiplier = multiplier * base

You can use your bigint library recursively using this math to figure out how to store the numbers. By that, I mean you need to implement BigInt*BigInt and BigInt+BigInt, so this is how you can convert bases. It's not the most efficient way, but it is a lot faster than division.

score 0 · Answer 3 · answered Aug 16 '12 at 21:22

One thing not mentioned in FryGuy's post is that in that method the arithmetic has to be carried out in the base you are converting to, while the base that is being used as a multiplier is the same as the base of your original number; the base you no longer want or the base you are converting from.

There are two formulas for radix conversion on integers. (1) uses base B, which we are converting to (destination base), as a repeated divisor coupled with arithmetic done in base b, where b is the base converting from. (2) on the other hand uses base b as a multiplier on the digits we are converting, which happen to be in the same base b, and does arithmetic in base B.

(1) formula uses B as a parameter and does arithmetic in base b

(2) formula uses b as parameter and does arithmetic in base B

this is out of knuth, vol 2, 4.4 radix conversions

**some further info:** arithmetic modulo some base _b_ can be carried out with well known algorithms mentioned in the same book — paul245, Aug 17 '12 at 16:22