5

I've got a byte array that was created using a hash function. I would like to convert this array into a string. So far so good, it will give me hexadecimal string.

Now I would like to use something different than hexadecimal characters, I would like to encode the byte array with these 36 characters: [a-z][0-9].

How would I go about?

Edit: the reason I would to do this, is because I would like to have a smaller string, than a hexadecimal string.

Kees C. Bakker
  • 32,294
  • 27
  • 115
  • 203

7 Answers7

6

I adapted my arbitrary-length base conversion function from this answer to C#:

static string BaseConvert(string number, int fromBase, int toBase)
{
    var digits = "0123456789abcdefghijklmnopqrstuvwxyz";
    var length = number.Length;
    var result = string.Empty;

    var nibbles = number.Select(c => digits.IndexOf(c)).ToList();
    int newlen;
    do {
        var value = 0;
        newlen = 0;

        for (var i = 0; i < length; ++i) {
            value = value * fromBase + nibbles[i];
            if (value >= toBase) {
                if (newlen == nibbles.Count) {
                    nibbles.Add(0);
                }
                nibbles[newlen++] = value / toBase;
                value %= toBase;
            }
            else if (newlen > 0) {
                if (newlen == nibbles.Count) {
                    nibbles.Add(0);
                }
                nibbles[newlen++] = 0;
            }
        }
        length = newlen;
        result = digits[value] + result; //
    }
    while (newlen != 0);

    return result;
}

As it's coming from PHP it might not be too idiomatic C#, there are also no parameter validity checks. However, you can feed it a hex-encoded string and it will work just fine with

var result = BaseConvert(hexEncoded, 16, 36);

It's not exactly what you asked for, but encoding the byte[] into hex is trivial.

See it in action.

Community
  • 1
  • 1
Jon
  • 428,835
  • 81
  • 738
  • 806
  • Why do I get the same answer if I change the digit string to `var digits = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";`? – Kees C. Bakker Sep 13 '11 at 08:54
  • @Kees: `digits` is just the digits available for use. To actually *make* use of the caps you need to pass an appropriately higher base as the second and/or third parameter. – Jon Sep 13 '11 at 08:58
  • I see... just made digits static and gave the length of it to the function. Perfect solution and very readable. – Kees C. Bakker Sep 13 '11 at 09:06
  • @Jon Why is there the `if (newlen == nibbles.Count) { nibbles.Add(0); }`? I fail to see why that would be required. – Paya Mar 17 '15 at 00:07
  • @Paya: if `newlen == nibbles.Count` and you don't add anything then the access `nibbles[newlen++]` immediately afterwards is going to throw. – Jon Apr 14 '15 at 19:55
  • @Jon yeah but is there any input that would actually turn that condition into "true"? When I analyzed the algorithm, I came to conclusion that the condition will never evaluate to *true*. – Paya Apr 14 '15 at 21:47
3

Earlier tonight I came across a codereview question revolving around the same algorithm being discussed here. See: https://codereview.stackexchange.com/questions/14084/base-36-encoding-of-a-byte-array/

I provided a improved implementation of one of its earlier answers (both use BigInteger). See: https://codereview.stackexchange.com/a/20014/20654. The solution takes a byte[] and returns a Base36 string. Both the original and mine include simple benchmark information.

For completeness, the following is the method to decode a byte[] from an string. I'll include the encode function from the link above as well. See the text after this code block for some simple benchmark info for decoding.

const int kByteBitCount= 8; // number of bits in a byte
// constants that we use in FromBase36String and ToBase36String
const string kBase36Digits= "0123456789abcdefghijklmnopqrstuvwxyz";
static readonly double kBase36CharsLengthDivisor= Math.Log(kBase36Digits.Length, 2);
static readonly BigInteger kBigInt36= new BigInteger(36);

// assumes the input 'chars' is in big-endian ordering, MSB->LSB
static byte[] FromBase36String(string chars)
{
    var bi= new BigInteger();
    for (int x= 0; x < chars.Length; x++)
    {
        int i= kBase36Digits.IndexOf(chars[x]);
        if (i < 0) return null; // invalid character
        bi *= kBigInt36;
        bi += i;
    }

    return bi.ToByteArray();
}

// characters returned are in big-endian ordering, MSB->LSB
static string ToBase36String(byte[] bytes)
{
    // Estimate the result's length so we don't waste time realloc'ing
    int result_length= (int)
        Math.Ceiling(bytes.Length * kByteBitCount / kBase36CharsLengthDivisor);
    // We use a List so we don't have to CopyTo a StringBuilder's characters
    // to a char[], only to then Array.Reverse it later
    var result= new System.Collections.Generic.List<char>(result_length);

    var dividend= new BigInteger(bytes);
    // IsZero's computation is less complex than evaluating "dividend > 0"
    // which invokes BigInteger.CompareTo(BigInteger)
    while (!dividend.IsZero)
    {
        BigInteger remainder;
        dividend= BigInteger.DivRem(dividend, kBigInt36, out remainder);
        int digit_index= Math.Abs((int)remainder);
        result.Add(kBase36Digits[digit_index]);
    }

    // orientate the characters in big-endian ordering
    result.Reverse();
    // ToArray will also trim the excess chars used in length prediction
    return new string(result.ToArray());
}

"A test 1234. Made slightly larger!" encodes to Base64 as "165kkoorqxin775ct82ist5ysteekll7kaqlcnnu6mfe7ag7e63b5"

To decode that Base36 string 1,000,000 times takes 12.6558909 seconds on my machine (I used the same build and machine conditions as provided in my answer on codereview)

You mentioned that you were dealing with a byte[] for the MD5 hash, rather than a hexadecimal string representation of it, so I think this solution provide the least overhead for you.

Community
  • 1
  • 1
kornman00
  • 808
  • 10
  • 27
  • Performance in `FromBase36String` can be improved by using a Dictionary for O(1) lookup instead of `.IndexOf( chars[x] )`, or better-yet: an array that maps from each base36 char-code back to the `base36Digits` array index. – Dai Aug 12 '16 at 09:50
  • 1
    I copied your code into a test project and it encodes `{ 203, 77, 29, 30, 215, 37, 184, 136 }` to `"1taukyt738g1h"` but then decodes it to `{ 53, 178, 226, 225, 40, 218, 71, 119 }`. – Dai Aug 12 '16 at 10:17
  • 1) Without actual perf stats to back the claim, I'm hesitant to accept that using a dictionary would be an improvement. It would require additional allocations, as what I have here requires no extra heap allocations and in general should have better cache locality. The array mapping would just require more bootstrapping. 2) Thanks, using http://rextester.com/ I'm seeing the same bad behavior you are. I'll have to debug it when I'm on a better dev machine and test to see if my Base-N encoder suffers the same problem http://stackoverflow.com/questions/14110010/base-n-encoding-of-a-byte-array – kornman00 Aug 14 '16 at 02:51
2

If you want a shorter string and can accept [a-zA-Z0-9] and + and / then look at Convert.ToBase64String

erikH
  • 2,286
  • 1
  • 17
  • 19
2

Using BigInteger (needs the System.Numerics reference)

Using BigInteger (needs the System.Numerics reference)

const string chars = "0123456789abcdefghijklmnopqrstuvwxyz";

// The result is padded with chars[0] to make the string length
// (int)Math.Ceiling(bytes.Length * 8 / Math.Log(chars.Length, 2))
// (so that for any value [0...0]-[255...255] of bytes the resulting
// string will have same length)
public static string ToBaseN(byte[] bytes, string chars, bool littleEndian = true, int len = -1)
{
    if (bytes.Length == 0 || len == 0)
    {
        return String.Empty;
    }

    // BigInteger saves in the last byte the sign. > 7F negative, 
    // <= 7F positive. 
    // If we have a "negative" number, we will prepend a 0 byte.
    byte[] bytes2;

    if (littleEndian)
    {
        if (bytes[bytes.Length - 1] <= 0x7F)
        {
            bytes2 = bytes;
        }
        else
        {
            // Note that Array.Resize doesn't modify the original array,
            // but creates a copy and sets the passed reference to the
            // new array
            bytes2 = bytes;
            Array.Resize(ref bytes2, bytes.Length + 1);
        }
    }
    else
    {
        bytes2 = new byte[bytes[0] > 0x7F ? bytes.Length + 1 : bytes.Length];

        // We copy and reverse the array
        for (int i = bytes.Length - 1, j = 0; i >= 0; i--, j++)
        {
            bytes2[j] = bytes[i];
        }
    }

    BigInteger bi = new BigInteger(bytes2);

    // A little optimization. We will do many divisions based on 
    // chars.Length .
    BigInteger length = chars.Length;

    // We pre-calc the length of the string. We know the bits of 
    // "information" of a byte are 8. Using Log2 we calc the bits of 
    // information of our new base. 
    if (len == -1)
    {
        len = (int)Math.Ceiling(bytes.Length * 8 / Math.Log(chars.Length, 2));
    }

    // We will build our string on a char[]
    var chs = new char[len];
    int chsIndex = 0;

    while (bi > 0)
    {
        BigInteger remainder;
        bi = BigInteger.DivRem(bi, length, out remainder);

        chs[littleEndian ? chsIndex : len - chsIndex - 1] = chars[(int)remainder];
        chsIndex++;

        if (chsIndex < 0)
        {
            if (bi > 0)
            {
                throw new OverflowException();
            }
        }
    }

    // We append the zeros that we skipped at the beginning
    if (littleEndian)
    {
        while (chsIndex < len)
        {
            chs[chsIndex] = chars[0];
            chsIndex++;
        }
    }
    else
    {
        while (chsIndex < len)
        {
            chs[len - chsIndex - 1] = chars[0];
            chsIndex++;
        }
    }

    return new string(chs);
}

public static byte[] FromBaseN(string str, string chars, bool littleEndian = true, int len = -1)
{
    if (str.Length == 0 || len == 0)
    {
        return new byte[0];
    }

    // This should be the maximum length of the byte[] array. It's 
    // the opposite of the one used in ToBaseN.
    // Note that it can be passed as a parameter
    if (len == -1)
    {
        len = (int)Math.Ceiling(str.Length * Math.Log(chars.Length, 2) / 8);
    }

    BigInteger bi = BigInteger.Zero;
    BigInteger length2 = chars.Length;
    BigInteger mult = BigInteger.One;

    for (int j = 0; j < str.Length; j++)
    {
        int ix = chars.IndexOf(littleEndian ? str[j] : str[str.Length - j - 1]);

        // We didn't find the character
        if (ix == -1)
        {
            throw new ArgumentOutOfRangeException();
        }

        bi += ix * mult;

        mult *= length2;
    }

    var bytes = bi.ToByteArray();

    int len2 = bytes.Length;

    // BigInteger adds a 0 byte for positive numbers that have the
    // last byte > 0x7F
    if (len2 >= 2 && bytes[len2 - 1] == 0)
    {
        len2--;
    }

    int len3 = Math.Min(len, len2);

    byte[] bytes2;

    if (littleEndian)
    {
        if (len == bytes.Length)
        {
            bytes2 = bytes;
        }
        else
        {
            bytes2 = new byte[len];
            Array.Copy(bytes, bytes2, len3);
        }
    }
    else
    {
        bytes2 = new byte[len];

        for (int i = 0; i < len3; i++)
        {
            bytes2[len - i - 1] = bytes[i];
        }
    }

    for (int i = len3; i < len2; i++)
    {
        if (bytes[i] != 0)
        {
            throw new OverflowException();
        }
    }

    return bytes2;
}

Be aware that they are REALLY slow! REALLY REALLY slow! (2 minutes for 100k). To speed them up you would probably need to rewrite the division/mod operation so that they work directly on a buffer, instead of each time recreating the scratch pads as it's done by BigInteger. And it would still be SLOW. The problem is that the time needed to encode the first byte is O(n) where n is the length of the byte array (this because all the array needs to be divided by 36). Unless you want to work with blocks of 5 bytes and lose some bits. Each symbol of Base36 carries around 5.169925001 bits. So 8 of these symbols would carry 41.35940001 bits. Very near 40 bytes.

Note that these methods can work both in little-endian mode and in big-endian mode. The endianness of the input and of the output is the same. Both methods accept a len parameter. You can use it to trim excess 0 (zeroes). Note that if you try to make an output too much small to contain the input, an OverflowException will be thrown.

xanatos
  • 109,618
  • 12
  • 197
  • 280
  • 1
    Are you sure this gives good results? It gives different results than my solution, which is [demonstratably reversible](http://www.ideone.com/ibTXw). I used the same "chars" string. – Jon Sep 13 '11 at 08:35
  • big integer should be created from reverced byte array `var bi = new BigInteger(bytes.Concat(new byte[] { 0 }).Reverse().ToArray());` – hazzik Sep 13 '11 at 08:39
  • @hazzik Done. And I've found another bug with prepended '\0' – xanatos Sep 13 '11 at 09:03
  • WOW finally an encoding using BigInteger that actually works. Everything else I tried produced unreliable results due to the sign byte. Thanks! – Shawn Jan 16 '15 at 01:50
  • Why are you reversing the array twice? Seems a bit redundant. – Paya Mar 16 '15 at 21:31
  • @Paya Because BigInteger is Little Endian, so one thousand is saved as 0001, while "humans" read numbers as Big Endian, so we want to read one thousand as 1000 – xanatos Mar 17 '15 at 06:09
  • @xanatos But that assumes that you - as a human - are going to read the byte array (and string) yourself, right? I mean, if you are just interested in encoding byte arrays into text and you don't care about reading it yourself personally (which honestly why would anyone do that?), then you can just skip some of the reversions, save yourself a bit of processing power and still decode it to the original byte array. I mean, the original question was about binary-to-text encoding, so the OP probably doesn't care about having 100 % mathematically correct radix conversion. – Paya Mar 17 '15 at 11:00
  • @Paya You are right, but I do think that it's easier to comprehend if the "direction" of the number remains the same. If you don't need it, then you can remove it. (but note that I do think the Jon solution was better, because he reimplemented the division function of BigInteger :-) (as I've said it was better to do *To speed them up you would probably need to rewrite the division/mod operation so that they work directly on a buffer, instead of each time recreating the scratch pads as it's done by BigInteger*) – xanatos Mar 17 '15 at 11:08
  • @xanatos Well, I benchmarked both solutions and yours is almost three times faster. :-) I was quite surprised by the results but I guess the `BigInteger` division is much more optimized than Jon's "elementary-school division". I even tried to optimize Jon's approach to use `StringBuilder` as well as to estimate the output length at the start, but that did not help much. – Paya Mar 17 '15 at 12:48
  • @Paya I've changed both methods... Now they have a flag to control endianness. I've even fixed the `0` padding (it wasn't working correctly). I've changed from `StringBuilder` to `char[]`. In the end the speed is the same, but I think they are more "correct". As I've said, before they were working BigEndian->BigEndian (like Jon response) – xanatos Mar 17 '15 at 15:41
  • @xanatos Actually I think you were handling the padding correctly before. With the current implementation, if you `Decode(Encode(new byte[] { 0, 0, 0, 0, 0 }))`, you won't get the same array, but you will get an array with 6 zeroes now. – Paya Mar 17 '15 at 16:24
  • I even think your previous solution was the most compact radix36 encoding I have ever seen - because you had a chance to encode the leading '0' bytes 1:1 byte-wise. And it seems to preserve the byte array INTACT, meaning that regardless of the content, decode(encode()) would always give you the exact same array. – Paya Mar 17 '15 at 16:32
  • @Paya I've decided to go to a fixed-length encoding... If you need to decode to the original, there is an optional `len` parameter exactly for this. `Decode(Encode(new byte[] { 0, 0, 0, 0, 0 }, chars),chars,true,5)` should give the original result. The old encoding was a variable length with the length that varied between the length of the byte array and the `Math.Ceiling(...)`, so for a byte[5], the length was between 5 and 8. If you had printed it, it wouldn't have been the best looking encoding of the world :-) So for a byte[5] you could have: 00000, 0000z, 000010, e13wu1of. Not so nice. – xanatos Mar 18 '15 at 08:10
  • For another implementation of Base-N en/decoding (albeit, I also used BigInt): http://stackoverflow.com/questions/14110010/base-n-encoding-of-a-byte-array – kornman00 Aug 14 '16 at 02:57
0
System.Text.Encoding enc = System.Text.Encoding.ASCII;
string myString = enc.GetString(myByteArray);

You can play with what encoding you need:

System.Text.ASCIIEncoding,
System.Text.UnicodeEncoding,
System.Text.UTF7Encoding,
System.Text.UTF8Encoding

To match the requrements [a-z][0-9] you can use it:

Byte[] bytes = new Byte[] { 200, 180, 34 };
string result = String.Join("a", bytes.Select(x => x.ToString()).ToArray());

You will have string representation of bytes with char separator. To convert back you will need to split, and convert the string[] to byte[] using the same approach with .Select().

Samich
  • 29,157
  • 6
  • 68
  • 77
  • How does this guarantee that all the characters in the string will be in the range [a-z][0-9]? – Kees C. Bakker Sep 13 '11 at 07:57
  • Regarding the answer: All of these encodings are valid for the required character range. No need to play. – Stephan Sep 13 '11 at 08:00
  • It's not guarantee that all of this will be in this range because each byte can represents not only chars and digits. You can save each byte like integer with separator using any characters if you need to match these requrements. It will looks like: `250a244a...`. It's just an option how you can match [a-z][0-9]. – Samich Sep 13 '11 at 08:00
0

Usually a power of 2 is used - that way one character maps to a fixed number of bits. An alphabet of 32 bits for instance would map to 5 bits. The only challenge in that case is how to deserialize variable-length strings.

For 36 bits you could treat the data as a large number, and then:

  • divide by 36
  • add the remainder as character to your result
  • repeat until the division results in 0

Easier said than done perhaps.

C.Evenhuis
  • 25,996
  • 2
  • 58
  • 72
0

you can use modulu. this example encode your byte array to string of [0-9][a-z]. change it if you want.

    public string byteToString(byte[] byteArr)
    {
        int i;
        char[] charArr = new char[byteArr.Length];
        for (i = 0; i < byteArr.Length; i++)
        {
            int byt = byteArr[i] % 36; // 36=num of availible charachters
            if (byt < 10)
            {
                charArr[i] = (char)(byt + 48); //if % result is a digit
            }
            else
            {
                charArr[i] = (char)(byt + 87); //if % result is a letter
            }
        }
        return new String(charArr);
    }

If you don't want to lose data for de-encoding you can use this example:

    public string byteToString(byte[] byteArr)
    {
        int i;
        char[] charArr = new char[byteArr.Length*2];
        for (i = 0; i < byteArr.Length; i++)
        {
            charArr[2 * i] = (char)((int)byteArr[i] / 36+48);
            int byt = byteArr[i] % 36; // 36=num of availible charachters
            if (byt < 10)
            {
                charArr[2*i+1] = (char)(byt + 48); //if % result is a digit
            }
            else
            {
                charArr[2*i+1] = (char)(byt + 87); //if % result is a letter
            }
        }
        return new String(charArr);
    }

and now you have a string double-lengthed when odd char is the multiply of 36 and even char is the residu. for example: 200=36*5+20 => "5k".

Seffix
  • 1,049
  • 1
  • 12
  • 16