18

I'm looking for a function that will generate an "alphanumeric hash". Given a source string, it produces a determinate result string that can contain any letter a-z or digit 0-9, and cannot be reverse-engineered to produce the source. This will be used to generate passwords for a system based on secret data, so strings between 8 and 12 characters are ideal and a secure hash would also be ideal.

I'm thinking I can use a normal bitwise hash, XOR-fold it to 64 bits (if I use, for instance, SHA256) and then take the result 5 bits at a time (producing a number 0-31) and look up the character code to use from an indexed ordered collection. There are 26 letters and 10 digits meaning I'll have to leave a few out (probably removing characters that could be mistaken for others if handwritten). 64 bits, 5 bits at a time, will produce a 12-character string with 4 bits left over.

However, I'm worried about two things: first, introducing bias by taking a non-power-of-2 number of bits; and second, what to do with the leftover bits. Do I use them as-is knowing there will only be 16 possibilities, do I leave them off (and lose data possibly introducing bias), or do I incorporate one more bit to make a 13-character string (and where should the last bit come from)?

EDIT: Here's my current stab at it; it takes an enumerable of bytes (like the byte array produced by most hash algorithms) and returns a string:

    /// <summary>
    /// Converts an IEnumerable of bytes to a string representation which can have any lowercase letter a-z except for l, o, q and z, and any digit 0-9.
    /// Uses 5 bits of the byte array at a time to generate numbers from 0 to 31, which are then translated to letters or numbers.
    /// </summary>
    /// <param name="toConvert">the byte array to convert.</param>
    /// <returns>A string containing the alphanumeric case-insensitive representation of the bytes in the array.</returns>
    public static string ToInsensitiveAlphaNumericString(this IEnumerable<byte> toConvert)
    {
        var chars = new[]
                        {
                            'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'm', 'n', 'p', 'r', 's', 't',
                            'u', 'v', 'w', 'x', 'y', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
                        };

        var enumerator = toConvert.GetEnumerator();
        enumerator.MoveNext();

        int buffer = enumerator.Current;
        short bufferLength = 8;
        const int valueLength = 5;

        var builder = new StringBuilder();

        while (true)
        {
            var value = buffer >> (bufferLength - valueLength);

            builder.Append(chars[value]);

            buffer = buffer - (value << (bufferLength - valueLength));
            bufferLength -= valueLength;

            if(bufferLength < valueLength )
            {
                if (enumerator.MoveNext())
                {
                    buffer = (buffer << 8) + enumerator.Current;
                    bufferLength += 8;
                }
                else
                {
                    //here's the main question; to include, or not to include?
                    if (bufferLength > 0)
                        builder.Append(chars[buffer]);
                    break;
                }
            }
        }

        return builder.ToString();
    }
KeithS
  • 70,210
  • 21
  • 112
  • 164

2 Answers2

16

How about generating your SHA256 and then Base36 encoding the result? No left over bits, no bias...

That way you have the cryptographic strength of a proven algorithm (remember to salt and use multiple hash iterations) along with the alphanumeric representation that you need.

Eric J.
  • 147,927
  • 63
  • 340
  • 553
  • If your looking to just disguise what system you used for hashing and don't want the result to be easily reversible to the output of the hash function, just do this and apply a Caesar shift to it - simple, old, but can still give things an interesting twist :) – John Mitchell Jul 27 '12 at 01:10
  • This solution will work fine, but it's worth noting that there is still a biased character. If the base-36 representation of 2^{256} has 6 as its first digit (I think it does, but I have only checked sloppily), then every encoded value will have a value between 0 and 6 in that position. – David Jul 27 '12 at 01:33
  • 1
    @David: You will have a bias *of the character positions* but that does not matter. Base36 is just a *human-convenient representation* of a perfectly unbiased 256-bit number. – Eric J. Jul 27 '12 at 01:51
  • Yep, I totally agree. I just mentioned it because the OP mentioned concern about biased characters (when he was already talking about a human-convenient representation of a perfectly unbiased 64-bit number). – David Jul 27 '12 at 01:59
  • This is a good idea, and trivial on a 64-bit number. Slightly more complex on a number stored in multiple bytes, but still totally possible. – KeithS Jul 27 '12 at 14:25
  • Only one possible issue with this; a hash value can be any integer value that will fit in the hash's generated bits. A UInt64, for instance, can be zero, or it can be 18,446,744,073,709,551,615. I'm only guaranteed to get a 12-char password if the value is greater than 36^12 (about a 75% chance), and there is a not-insignificant chance of 10 or fewer characters. While infinitesimal, there is the possibility that I could end up with a single-character password. – KeithS Jul 27 '12 at 20:17
  • If you pad with 0's, you'll still end up with suitably long passwords. The character sequence "0000abcd" is just as probable as "0001abcd" so an attacker would not gain any benefit by knowing that certain values are zero-padded. Those values still have to be tried, and are just as probable as any other sequence. Short passwords aren't bad *per se*, the are bad because humans tend to pick very short passwords creating a strong statistical bias toward short passwords. "a" would be just as good a password as "dut6dja1jdi9" if not for that bias. – Eric J. Jul 27 '12 at 20:38
  • For future readers. If you can be a little more "flexible".. to all alpha-numerics (including lower-case, which the original question does not desire)....then you can consider Base62, and this SOF answer has some dotnet libraries that exist : https://stackoverflow.com/questions/17745025/convert-a-string-into-base62/17745194#17745194 but those dotnet libraries may help you find java/python/other equivalents – granadaCoder Nov 18 '21 at 15:38
  • And during this process of discovery I learned about base58. See : https://en.wikipedia.org/wiki/Base62 " The 0OIl characters are not used in the base58 encoding scheme." – granadaCoder Nov 18 '21 at 15:41
0

If you just use those bits as they are (so that one character only has 16 possibilities), you still have your full 64 bits of entropy. If you're happy with 64 bits of entropy (which it sounds like you are), there's no reason to mind that one character has a restricted range.

If you have some reason (aesthetics?) to prefer that all of the characters have the full range, then you can drop those 4 bits, but you'll be taking yourself down to 60 bits of entropy. If you would have been happy with 8-character passwords, then it sounds like 60 bits is also plenty.

So whichever of those is easier should work fine.

David
  • 1,429
  • 8
  • 8