1

I have this method which returns a string of cryptographically strong random characters.

First, GetBytes() populates a byte array with values from 0 to 255.

Next, the return string is built by picking character number {byte value} % {length of character set} from the character set.

The problem is that the length of most of my character sets are not evenly divisible by 256, so the result will be biased towards some characters. If for instance the character set is 8, 16 or 32 characters long, the remainder is 0 and there is no problem.

So I was thinking - can I restrict the values returned by GetBytes() in such a way that the length of the character set will be evenly divisible by the max value? For instance if the character set is of length 62, the max value should be 247.

I could of course just get one byte at a time, and if the value was too high, I could get another one. But that's not very elegant.

/// <summary>
/// Returns a string of cryptographically sound random characters
/// </summary>
/// <param name="type">Accepted parameter variables are HEX (0-F), hex (0-f),
/// DEC/dec/NUM/num (0-9), ALPHA (A-Z), alpha (a-z), ALPHANUM (A-Z and 0-9),
/// alphanum (a-z and 0-9) and FULL/full (A-Z, a-z and 0-9)</param>
/// <param name="length">The length of the output string</param>
/// <returns>String of cryptographically sound random characters</returns>
public static string Serial(string type, int length)
{
    if (length < 1) return "";
    string chars;
    switch (type)
    {
        case "HEX":
            chars = "0123456789ABCDEF"; // 16
            break;
        case "hex":
            chars = "0123456789abcdef"; // 16
            break;
        case "DEC":
        case "dec":
        case "NUM":
        case "num":
            chars = "0123456789"; // 10
            break;
        case "ALPHA":
            chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"; // 26
            break;
        case "alpha":
            chars = "abcdefghijklmnopqrstuvwxyz"; // 26
            break;
        case "ALPHANUM":
            chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"; // 36
            break;
        case "alphanum":
            chars = "abcdefghijklmnopqrstuvwxyz0123456789"; // 36
            break;
        case "FULL":
        case "full":
        default:
            chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"; // 62
            break;
    }
    byte[] data = new byte[length];
    using (RNGCryptoServiceProvider crypto = new RNGCryptoServiceProvider())
    {
        crypto.GetBytes(data);
    }
    StringBuilder result = new StringBuilder(length);
    foreach (byte b in data)
    {
        result.Append(chars[b % chars.Length]);
    }
    return result.ToString();
}
Stian
  • 1,522
  • 2
  • 22
  • 52
  • 2
    "I could of course just get one byte at a time, and if the value was too high, I could get another one. But that's not very elegant." - you're free to think so but it's the standard way of addressing this sort of situation. (Also, 0-255, not 1-256) – Damien_The_Unbeliever Aug 24 '18 at 08:03
  • 2
    Also, why are you writing a stringly typed method instead of using an `Enum` to indicate what type of encoding you'll use? – Damien_The_Unbeliever Aug 24 '18 at 08:04

1 Answers1

2

As indicated in the comments, rejection sampling is the standard way of doing this. We can amortize some of the costs by moving our use of the RNG Crypto provider into a helper method so that we don't have to work byte-by-agonizing-byte with it:

    public static string Serial(string type, int length)
    {
        if (length < 1) return "";
        string chars;
        switch (type)
        {
            case "HEX":
                chars = "0123456789ABCDEF"; // 16
                break;
            case "hex":
                chars = "0123456789abcdef"; // 16
                break;
            case "DEC":
            case "dec":
            case "NUM":
            case "num":
                chars = "0123456789"; // 10
                break;
            case "ALPHA":
                chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"; // 26
                break;
            case "alpha":
                chars = "abcdefghijklmnopqrstuvwxyz"; // 26
                break;
            case "ALPHANUM":
                chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"; // 36
                break;
            case "alphanum":
                chars = "abcdefghijklmnopqrstuvwxyz0123456789"; // 36
                break;
            case "FULL":
            case "full":
            default:
                chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"; // 62
                break;
        }
        int limit = (256 / chars.Length) * chars.Length;
        StringBuilder result = new StringBuilder(length);
        foreach (byte b in SecureBytesInRange(limit,length))
        {
            result.Append(chars[b % chars.Length]);
        }
        return result.ToString();
    }
    private const int SECURE_BYTE_BUFFER_SIZE = 32;
    static IEnumerable<byte> SecureBytesInRange(int exclusiveUpperBound, int countRequired)
    {
        using (RNGCryptoServiceProvider crypto = new RNGCryptoServiceProvider())
        {
            byte[] buffer = new byte[SECURE_BYTE_BUFFER_SIZE];
            int ix = SECURE_BYTE_BUFFER_SIZE;
            int countProduced = 0;
            while (countProduced < countRequired)
            {
                if (ix == SECURE_BYTE_BUFFER_SIZE)
                {
                    crypto.GetBytes(buffer);
                    ix = 0;
                }
                while (ix < SECURE_BYTE_BUFFER_SIZE)
                {
                    if (buffer[ix] < exclusiveUpperBound)
                    {
                        yield return buffer[ix];
                        countProduced++;
                        if (countProduced == countRequired) break;
                    }
                    ix++;
                }
            }
        }
    }

As I also indicated in the comments, I'd create an enum for the supported encoding types rather than using a string, or in the alternative, have named constants/properties that return the actual ranges of characters to be used, so instead you're directly passed chars rather than type (this also adds flexibility by allowing your function to be used with other ranges of characters than just the ones you can currently think of.

Damien_The_Unbeliever
  • 234,701
  • 27
  • 340
  • 448
  • Great idea to pass the chars instead of selecting from a limited set! I'm on mobile now. When I get back to my PC I will try out your solution and accept the answer. – Stian Aug 25 '18 at 12:25