String encoding issue?

Question

I have developed a convertBase function that is able to convert a value into different bases and back.

string convertBase(string value, int fBase, int tBase) {
    string  charset =  "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ+/",
            fromRange = charset.substr(0, fBase),
            toRange = charset.substr(0, tBase),
            cc(value.rbegin(), value.rend()),
            res = "";
    unsigned long   int dec = 0;
                    int index = 0;

    for(char& digit : cc) {
        if (charset.find(digit) == std::string::npos) return "";
        dec += fromRange.find(digit) * pow(fBase, index);
        index++;
    }
    while (dec > 0) {
        res = toRange[dec % tBase] + res;
        dec = (dec - (dec % tBase)) / tBase;
    }; return res;
}

The code is working while encoding simple string like "Test" and back again but gets it problems with encoding long strings like "Test1234567" because it gets encoded as "33333333333333333333333333333333" and that seems to be absolutely wrong!

Why is this happening and how to fix this issue?

First of all, if you're programming in C++ then please don't add any irrelevant language tags. Secondly, this is the perfect time to [learn how to debug your programs](https://ericlippert.com/2014/03/05/how-to-debug-small-programs/). — Some programmer dude, Mar 30 '18 at 19:23
What values have `fBase`, `tBase`? Are `unsigned long` big enough for your calculations? (I expect: no). Which overload of `pow` are you using - no declaration returns integer types at http://www.cplusplus.com/reference/cmath/pow/ — milbrandt, Mar 30 '18 at 19:39
Ok sound legit, but what would be a alternative way instead of "unsigned long"? @milbrandt — , Mar 30 '18 at 19:40
You might use some big int library like referenced in https://stackoverflow.com/questions/1055661/bigint-bigbit-library. Would your algorithm work with `long double` precisssion? — milbrandt, Mar 30 '18 at 19:44
Modulo is not an issue. `a % b = a - floor(a/b)*b` is valid also for decimal types. — milbrandt, Mar 30 '18 at 19:49
You're right but I've tried to use `long double` but it doesn't seems working either. (decoded looks closer to value but still not the same) @milbrandt — , Mar 30 '18 at 20:15
I just copied your code in Visual Studio and got already the compiler *warning C4244: '+=': conversion from 'double' to 'unsigned long', possible loss of data*. And conversion convertBase("Test",3, 4) was not possible to recover the string. Either another algorithm or we'll have to have a closer look on your idea. — milbrandt, Mar 30 '18 at 20:21
`std::fmod` can also be used for modulo, there is already an overload for `long double` — milbrandt, Mar 30 '18 at 20:26

G. Sliepen · Answer 1 · 2018-03-31T09:36:21.613

1

A long int is typically 32 or 64 bits in size, depending on which CPU architecture you are on, but it can even have other sizes. You are adding bigger and bigger numbers to dec. At some point, the numbers become larger than a long int can hold, and then your program breaks down.

If you need to handle arbitrarily large inputs, you need to use a different approach. If you can, use a "bignum" or "bigint" library like GMP.

edited Mar 31 '18 at 09:36

answered Mar 30 '18 at 19:36

G. Sliepen

7,637
1
15
31

Thanks for you're quick answer. But I can not really get your thoughts tbh. What do you mean with "You have to find some way to add characters to res while you are processing cc"? Would you mind to share some code? :) – Mar 30 '18 at 19:42
I do mind, since it will be much more instructive if you try to find the solution yourself using the hints I gave you, than me just handing you the solution :) Anyway, you have already written most of the code. It's just that you wait until processing all input before starting to write the output. Instead, whenever you have processed one input character, try to see if `dec` is big enough that you can write some output, and then write that output. It should be similar to moving your whole `while`-loop inside the `for`-loop, but changing the condition somewhat, and some extra bookkeeping. – G. Sliepen Mar 30 '18 at 20:04
First of all, I would like to thank you extremely for the helpful tips & comments, but I have to be honest: I am sure your answer is completely correct & correct in content but I must honestly say that this answer does not bring me so much. I'm more of the "try" type than the "theory" type. And since I've been sitting at this en / decode function continuously for 2 days now and have just turned to SO in despair today & thus also to you & still do not really know how to finally solve the problem I'm really close to giving up. *"try to see if dec is big ... that output"*. No idea how to do this! – Mar 30 '18 at 20:11
1

You're missing the `declaration of charset, ...` and the `function call`! ^@G.Sliepen So you're code looks let's call it - destroyed :) **Edit:** And you did one mistake: You have to use `value` instead of the reversed value `cc` for the iteration. Please edit. – Mar 30 '18 at 20:39
Ok it's too late, I'll try to post better code tomorrow. But I think using the reversed value is still correct. – G. Sliepen Mar 30 '18 at 20:49
Hey thanks so much for you're help :) See you tomorrow, thanks in advance. – Mar 30 '18 at 21:07
Ok, what was I thinking yesterday? Of course my suggested approach won't work at all, except when `fBase % tBase == 0`. So either you have to limit yourself to inputs that are small enough to fit in the largest integer type supported by your compiler (typically `uint64_t` or if you are lucky, `__uint128_t`), or you need to implement your own bignum routines for multiplying and adding arbitrarily large numbers, or just use an existing library as @milbrandt suggested. – G. Sliepen Mar 31 '18 at 09:32
:( However -thanks anyway, even if I got no working solution. Mhm so it looks like I have to give up the custom Base en/decdoing. So bad :/ @G.Sliepen – Mar 31 '18 at 16:33

score 0 · Answer 2 · edited Mar 30 '18 at 20:58

When starting debugging, the big issues become obvious. Let's start with a call of convertBase("Test",3, 4).

string  charset =  "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ+/",
        fromRange = charset.substr(0, fBase),
        toRange = charset.substr(0, tBase),

The substr copies from the beginning (0) xBase characters. This will result in the example case fromRange = "012" and toRange="0123".

for(char& digit : cc) {
    if (charset.find(digit) == std::string::npos) return "";
    dec += fromRange.find(digit) * pow(fBase, index);

The test of digit in charset will be succesful, so no return will be executed.

But already in the first Iteration, when digit=='t', in fromRange.find(digit) it is not found and std::string::npos is multiplied with something. But not only t will be mapped to this value, any character in charset but not in fromRange will get the same value. This is not invertible, no bijection!

...... This leads to the conclusion that the algorithm will not work independent of any integer value limits.

In your question and your code there is no restriction to the arguments `fbase`, `tbase`. Please provide then your call of your method or tell at least any restrictions. I dislike guessing what you implicitly assume. — milbrandt, Mar 31 '18 at 07:08

String encoding issue?

2 Answers2