Short Unique Hexadecimal String in Python

Question

I need to generate a unique hexadecimal string in Python 3 that meets following requirements:

It should contain 6 characters
it should not contain just digits. There must be at least one character.
These generated strings should be random. They should not be in any order.
There should be minimum probability of conflict

I have considered uuid4(). But the problem is that it generates strings with too many characters and any substring of the generated string can contain all digits(i.e. no character) at some point.

Is there any other way to fulfill this conditions? Thanks in advance!

EDIT

Can we use a hash for example SHA-1 to fulfill above requirements?

uuid4() only generates digits 0-9 and characters a-f (i.e., hexadecimal digits). Do you only want hexadecimal digits, or are you open to having any character? If all characters are allowed, can they be both uppercase and lowercase? — Matthias Fripp, Aug 25 '16 at 07:27
Warning: 6 hex digits and requiring at least one letter a-f gives you 15,777,216 possible strings. Due to the birthday paradox, selecting these randomly you should expect to see individual repeats in collections on the size order of 4000 strings. Depending on what you are doing with these codes, you will want to allow for that. — Neil Slater, Aug 25 '16 at 07:32
The letter-only byte can go in 6 different positions, but in some cases different positions will produce identical strings. I think there are `16**6` (possible hex strings) - `10**6` (possible decimal-only strings) = `15 777 216` permutations. — Matthias Fripp, Aug 25 '16 at 09:12
@MoinuddinQuadri: I am sorry for the delay. But I couldn't test any of the answers yet as I am stuck right now in some other things. Sorry for that to all the people who answered and thanks to you guys!! — Jay Patel, Aug 26 '16 at 17:07
@mfripp: I need to generate just hexadecimal strings and a-f in lower cases! — Jay Patel, Aug 27 '16 at 19:08
@JayPatel, Sorry, I must have missed the title and first sentence for some reason! — Matthias Fripp, Aug 27 '16 at 23:30
You said "These generated strings should be random. They should not be in any order." How strong is that requirement? Can they be generated by hashing a counter, where they look random but someone might be able to deduce the formula if they observe enough samples (since they know the samples are generated in order) or if they read your source code? — Matthias Fripp, Aug 28 '16 at 01:58
Actually, even if they are generated in order, it will be no benefit. Because it may be called randomly. For example, a set of 20 different blocks and these generated strings will be used to refer each blocks and the blocks can be in any order! — Jay Patel, Aug 28 '16 at 04:09

score 4 · Accepted Answer · edited May 23 '17 at 12:32

Here's a simple method that samples evenly from all allowed strings. Sampling uniformly makes conflicts as rare as possible, short of keeping a log of previous keys or using a hash based on a counter (see below).

import random
digits = '0123456789'
letters = 'abcdef'
all_chars = digits + letters
length = 6

while True:

   val = ''.join(random.choice(all_chars) for i in range(length))

   # The following line might be faster if you only want hex digits.
   # It makes a long int with 24 random bits, converts it to hex,
   # drops '0x' from the start and 'L' from the end, then pads
   # with zeros up to six places if needed
   # val = hex(random.getrandbits(4*length))[2:-1].zfill(length)

   # test whether it contains at least one letter
   if not val.isdigit():
       break

# now val is a suitable string
print val
# 5d1d81

Alternatively, here's a somewhat more complex approach that also samples uniformly, but doesn't use any open-ended loops:

import random, bisect
digits = '0123456789'
letters = 'abcdef'
all_chars = digits + letters
length = 6

# find how many valid strings there are with their first letter in position i
pos_weights = [10**i * 6 * 16**(length-1-i) for i in range(length)]
pos_c_weights = [sum(pos_weights[0:i+1]) for i in range(length)]

# choose a random slot among all the allowed strings
r = random.randint(0, pos_c_weights[-1])

# find the position for the first letter in the string
first_letter = bisect.bisect_left(pos_c_weights, r)

# generate a random string matching this pattern
val = ''.join(
    [random.choice(digits) for i in range(first_letter)]
    + [random.choice(letters)]
    + [random.choice(all_chars) for i in range(first_letter + 1, length)]
)

# now val is a suitable string
print val
# 4a99f0

And finally, here's an even more complex method that uses the random number r to index directly into the entire range of allowed values, i.e., this converts any number in the range of 0-15,777,216 into a suitable hex string. This could be used to completely avoid conflicts (discussed more below).

import random, bisect
digits = '0123456789'
letters = 'abcdef'
all_chars = digits + letters
length = 6

# find how many valid strings there are with their first letter in position i
pos_weights = [10**i * 6 * 16**(length-1-i) for i in range(length)]
pos_c_weights = [sum(pos_weights[0:i+1]) for i in range(length + 1)]

# choose a random slot among all the allowed strings
r = random.randint(0, pos_c_weights[-1])

# find the position for the first letter in the string
first_letter = bisect.bisect_left(pos_c_weights, r) - 1

# choose the corresponding string from among all that fit this pattern
offset = r - pos_c_weights[first_letter]
val = ''
# convert the offset to a collection of indexes within the allowed strings 
# the space of allowed strings has dimensions
# 10 x 10 x ... (for digits) x 6 (for first letter) x 16 x 16 x ... (for later chars)
# so we can index across it by dividing into appropriate-sized slices
for i in range(length):
    if i < first_letter:
        offset, v = divmod(offset, 10)
        val += digits[v]
    elif i == first_letter:
        offset, v = divmod(offset, 6)
        val += letters[v]
    else:
        offset, v = divmod(offset, 16)
        val += all_chars[v]

# now val is a suitable string
print val
# eb3493

Uniform Sampling

I mentioned above that this samples uniformly across all allowed strings. Some other answers here choose 5 characters completely at random and then force a letter into the string at a random position. That approach produces more strings with multiple letters than you would get randomly. e.g., that method always produces a 6-letter string if letters are chosen for the first 5 slots; however, in this case the sixth selection should actually only have a 6/16 chance of being a letter. Those approaches can't be fixed by forcing a letter into the sixth slot only if the first 5 slots are digits. In that case, all 5-digit strings would automatically be converted to 5 digits plus 1 letter, giving too many 5-digit strings. With uniform sampling, there should be a 10/16 chance of completely rejecting the string if the first 5 characters are digits.

Here are some examples that illustrate these sampling issues. Suppose you have a simpler problem: you want a string of two binary digits, with a rule that at least one of them must be a 1. Conflicts will be rarest if you produce 01, 10 or 11 with equal probability. You can do that by choosing random bits for each slot, and then throwing out the 00's (similar to my approach above).

But suppose you instead follow this rule: Make two random binary choices. The first choice will be used as-is in the string. The second choice will determine the location where an additional 1 will be inserted. This is similar to the approach used by the other answers here. Then you will have the following possible outcomes, where the first two columns represent the two binary choices:

0 0 -> 10
0 1 -> 01
1 0 -> 11
1 1 -> 11

This approach has a 0.5 chance of producing 11, or 0.25 for 01 or 10, so it will increase the risk of collisions among 11 results.

You could try to improve this as follows: Make three random binary choices. The first choice will be used as-is in the string. The second choice will be converted to a 1 if the first choice was a 0; otherwise it will be added to the string as-is. The third choice will determine the location where the second choice will be inserted. Then you have the following possible outcomes:

0 0 0 -> 10 (second choice converted to 1)
0 0 1 -> 01 (second choice converted to 1)
0 1 0 -> 10
0 1 1 -> 01
1 0 0 -> 10
1 0 1 -> 01
1 1 0 -> 11
1 1 1 -> 11

This gives 0.375 chance for 01 or 10, and 0.25 chance for 11. So this will slightly increase the risk of conflicts between duplicate 10 or 01 values.

Reducing Conflicts

If you are open to using all letters instead of just 'a' through 'f' (hexadecimal digits), you could alter the definition of letters as noted in the comments. This will give much more diverse strings and much less chance of conflict. If you generated 1,000 strings allowing all upper- and lowercase letters, you'd only have about a 0.0009% chance of generating any duplicates, vs. 3% chance with hex strings only. (This will also virtually eliminate double-passes through the loop.)

If you really want to avoid conflicts between strings, you could store all the values you've generated previously in a set and check against that before breaking from the loop. This would be good if you are going to generate fewer than about 5 million keys. Beyond that, you'd need quite a bit of RAM to hold the old keys, and it might take a few runs through the loop to find an unused key.

If you need to generate more keys than that, you could encrypt a counter, as described at Generating non-repeating random numbers in Python. The counter and its encrypted version would both be ints in the range of 0 to 15,777,216. The counter would just count up from 0, and the encrypted version would look like a random number. Then you would convert the encrypted version to hex using the third code example above. If you do this, you should generate a random encryption key at the start, and change the encryption key each time the counter rolls past your maximum, to avoid producing the same sequence again.

Very nice explanation! Upvoted! But for some reason I am restricted not to use loops and I need to do it either with uuid4() or hashes!! — Jay Patel, Aug 28 '16 at 01:09
Hmm, that makes it more interesting. But hashes of what? Hash of a counter variable? Hash of a random number? — Matthias Fripp, Aug 28 '16 at 01:52
@JayPatel I've added a hash function that can convert any number in the range of 0-15777216 into a unique, valid string (code block 3). I feed this a random number to get a random string. But it's probably simpler to create a suitable random string directly (code block 2). — Matthias Fripp, Aug 29 '16 at 05:19
You are right! These methods are too complex! haha..kidding! But nice though... By hash I was referring to `sha1` with `hashlib`! But thank you so much for a well explained answer! — Jay Patel, Aug 29 '16 at 15:43
You have 4 objectives for the string: 1. have 6 hex chars including at least 1 letter; 2. uniformly distributed (minimize conflicts); 3. generated by a canned function like uuid or sha1 (or maybe a random 1-liner); 4. don't use loops. Unfortunately, these are mutually incompatible; you will have to sacrifice one. In particular, sha1, uuid or a simple random choice will generate strings that could be all-digits. Then forcing in an additional random letter will skew the distribution toward letter-heavy strings, increasing the chance of conflicts. This is probably fine if you value simplicity. — Matthias Fripp, Aug 29 '16 at 18:09

Moinuddin Quadri · Answer 2 · 2016-08-25T08:19:43.103

1

Note: Updated the answer for hexadecimal unique string. Earlier I assumed for alhanumeric string.

You may create your own unique function using uuid and random library

>>> import uuid
>>> import random
# Step 1: Slice uuid with 5 i.e. new_id = str(uuid.uuid4())[:5] 
# Step 2: Convert string to list of char i.e. new_id = list(new_id)
>>> uniqueval = list(str(uuid.uuid4())[:5])
# uniqueval = ['f', '4', '4', '4', '5']

# Step 3: Generate random number between 0-4 to insert new char i.e.
#         random.randint(0, 4)
# Step 4: Get random char between a-f (for Hexadecimal char) i.e.
#         chr(random.randint(ord('a'), ord('f')))
# Step 5: Insert random char to random index
>>> uniqueval.insert(random.randint(0, 4), chr(random.randint(ord('a'), ord('f'))))
# uniqueval = ['f', '4', '4', '4', 'f', '5']

# Step 6: Join the list
>>> uniqueval = ''.join(uniqueval)
# uniqueval = 'f444f5'

edited Aug 25 '16 at 08:19

answered Aug 25 '16 at 06:19

Moinuddin Quadri

46,825
13
96
126

1

I think this always includes at least one digit, but the question asked for at least one letter. – Matthias Fripp Aug 25 '16 at 07:10
Thanks. Yes, I miss understood that part. Updated the answer – Moinuddin Quadri Aug 25 '16 at 07:21
@MoinuddinQuadri: Just asking. Do you have any idea about the probability of repeating the same string? I mean how much higher the chances of uniqueness? Just want to get an idea! – Jay Patel Aug 27 '16 at 19:13
@MoinuddinQuadri: Did you do it 0-4 particularly with any purpose? What if we insert between 0-5? – Jay Patel Aug 27 '16 at 20:27

Martin Evans · Answer 3 · 2016-08-25T07:39:21.847

1

The following approach works as follows, first pick one random letter to ensure rule 2, then select 4 random entries from the list of all available characters. Shuffle the resulting list. Lastly prepend one value taken from the list of all entries except 0 to ensure the string has 6 characters.

import random

all = "0123456789abcdef"
result = [random.choice('abcdef')] + [random.choice(all) for _ in range(4)]
random.shuffle(result)
result.insert(0, random.choice(all[1:]))
print(''.join(result))

Giving you something like:

3b7a4e

This approach avoids having to repeatedly check the result to ensure that it satisfies the rules.

edited Aug 25 '16 at 07:39

answered Aug 25 '16 at 07:32

Martin Evans

45,791
17
81
97

Why did you make the first digit non-zero? – Matthias Fripp Aug 25 '16 at 09:19
It really depends on how it is going to be used. If the string is converted to a number, then leading zeros would make it less than 6 characters. If this isn't an issue then that step can be removed. – Martin Evans Aug 25 '16 at 09:21
This approach always produces a valid string, but it doesn't sample the allowed region evenly. Restricting the first choice to letters skews the result toward multi-letter strings, increasing the chance of collisions. – Matthias Fripp Aug 25 '16 at 20:28
By imposing the rule that there must be one letter, the result is already skewed. This script does not impose the restriction that the first choice is a letter, that is what `shuffle()` is for. – Martin Evans Aug 26 '16 at 06:45
My point was that your method will produce strings with more letters than you would get by sampling uniformly across the _allowed_ set of strings. This is true regardless of how they're shuffled. For example, your method will produce 6-letter strings with a probability of `(6/16)**5 = 0.0074`. But sampling uniformly among the allowed strings would produce 6-letter strings with probability of `(6**6)/(16**6 - 10**6) = 0.0030`, where `6**6` is the number of possible six-letter strings and `16**6 - 10**6` is the number of allowed strings. – Matthias Fripp Aug 27 '16 at 04:10

score 0 · Answer 4 · answered Aug 25 '16 at 07:56

This function returns the nth string conforming to your requirements, so you can simply generate unique integers and convert them using this function.

def inttohex(number, digits):
    # there must be at least one character:
    fullhex = 16**(digits - 1)*6
    assert number < fullhex
    partialnumber, remainder = divmod(number, digits*6)
    charposition, charindex = divmod(remainder, digits)
    char = ['a', 'b', 'c', 'd', 'e', 'f'][charposition]
    hexconversion = list("{0:0{1}x}".format(partialnumber, digits-1))
    hexconversion.insert(charposition, char)

    return ''.join(hexconversion)

Now you can get a particular one using for instance

import random

digits = 6
inttohex(random.randint(0, 6*16**(digits-1)), digits)

You can't have maximum randomness along with minimum probability of conflict. I recommend keeping track of which numbers you have handed out or if you are looping through all of them somehow, using a randomly sorted list.

Short Unique Hexadecimal String in Python

4 Answers4

Uniform Sampling

Reducing Conflicts