Algorithm to turn numeric IDs in to short, different alphanumeric codes

Question

I have IDs from a database, and I want them to be short and easily differentiatable by eye (i.e., two close numbers look different).

Like this:

13892359163211 -> ALO2WE7
13992351216421 -> 52NBEK3

or similar, algorithmically. So kind of like a hash, except it needs to be reversible? An encryption algorithm like AES is almost ideal, except that its outputs are way too long. (and overkill).

I'm using Python (3), although I don't think that should really matter

Why downvote? "This question does not show any research effort; it is unclear or not useful". Perfectly clear and useful, and I have already researched (see: AES and hashing) — retnikt, Aug 23 '19 at 10:34
Making them significant shorter requires a larger alphabet on the right hand side than just A-Z0-9. — President James K. Polk, Aug 23 '19 at 12:15
Sidenote, not really an answer to my question, but an alternative method: you could simply generate random strings and attach them to the rows/documents in the database. — retnikt, Aug 23 '19 at 19:01
That would be fine, but you still need to use a random technique that will not produce duplicates. — President James K. Polk, Aug 23 '19 at 19:09

nitzel · Answer 1 · 2019-08-23T13:49:09.713

New answer with 'close' numbers looking different

You could use RSA to encrypt (and later decrypt) your numbers. This is definitely overkill - but ... here is the example: Install https://github.com/sybrenstuvel/python-rsa (pip install rsa)

import rsa
import rsa.core
# (pubkey, privkey) = rsa.newkeys(64) # Generate key pair
pubkey = rsa.PublicKey(n=9645943279888986023, e=65537)
privkey = rsa.PrivateKey(n=9645943279888986023, e=65537, d=7507666207464026273, p=9255782423, q=1042153201)

print("1st", rsa.core.encrypt_int(13892359163211, pubkey.e, pubkey.n))
print("2nd", rsa.core.encrypt_int(13992351216421, pubkey.e, pubkey.n))
print("1st", hex(rsa.core.encrypt_int(13892359163211, pubkey.e, pubkey.n))[2:])
print("2nd", hex(rsa.core.encrypt_int(13992351216421, pubkey.e, pubkey.n))[2:])

# If you want to compare a couple of numbers that are similar
for i in range (13892359163211, 13892359163251):
  encrypted = rsa.core.encrypt_int(i, pubkey.e, pubkey.n)
  # decrypted = rsa.core.decrypt_int(encrypted, privkey.d, privkey.n)
  print (i, hex(encrypted)[2:], encrypted)

Please not that you cannot encrypt numbers bigger than pubkey.n. This is a RSA related limitation. By generating a different keypair with a higher n you can circumvent this issue. If you would like all generated numbers to have the same length, prefix them with leading zeroes. You could also consider making them uppercase for better readability. To make the displayed strings shorter consider using the base62 encoding mentioned in my old answer below.

output

1st 5427392181794576250
2nd 7543432434424555966
1st 4b51f86f0c99177a
2nd 68afa7d5110929be

input          hex(encrypted)   encrypted
13892359163211 4b51f86f0c99177a 5427392181794576250
13892359163212 2039f9a3f5cf5d46 2322161565485194566
13892359163213 173997b57918a6c3 1673535542221383363
13892359163214 36644663653bbb4  244958435527080884
13892359163215 c2eeec0c054e633  877901489011746355
...

Old answer related to displaying the numbers a bit shorter, not being aware that they should look substantially different

You want to change the base of your number from 10 to something bigger to use less characters. See https://stackoverflow.com/a/1119769 for an example with base 62 (a-zA-Z0-9).

Or quick and dirty for base 16, (0-9A-F, hexadecimal).

hex(13892359163211)[2:] # -> 'ca291220d4b'

The problem with this is that two similar decimal numbers will also look similar in hex — retnikt, Aug 23 '19 at 10:27
Ah I overlooked that in the question. Updated with using RSA. Please consider updating your question to make this requirement stand out a bit more. `short and easily differentiatable by eye` didn't quite convey the message of looking completely different for me. — nitzel, Aug 23 '19 at 13:41

President James K. Polk · Answer 2 · 2019-08-24T02:57:44.100

The problem is easier to state than it is to solve. One solution is to borrow some ideas from format-preserving encryption, but simplifying because security is not a goal. Using the Feistel cipher framework a very short and reversible "mixing" function can be written, followed by a short encoding function, to achieve something that appears to be what you want.

import hashlib
import string

mask = (1 << 22) - 1
alphabet = string.ascii_uppercase + string.digits


def func(x: int):
    return int.from_bytes(hashlib.sha256(x.to_bytes(3, 'big')).digest(), 'big') & mask


def mix(id_in: int):
    L, R = id_in >> 22, id_in & mask
    L ^= func(R)
    R ^= func(L)
    return (L << 22) | R


def unmix(mixed: int):
    L, R = mixed >> 22, mixed & mask
    R ^= func(L)
    L ^= func(R)
    return (L << 22) | R


def base_n_encode(value: int):
    digits = []
    for i in range(9):
        value, rem = divmod(value, len(alphabet))
        digits.insert(0, rem)
    return ''.join(alphabet[digit] for digit in digits)


def base_n_decode(encoded: str):
    digits = [alphabet.index(ch) for ch in encoded]
    result = 0
    for digit in digits:
        result = result * len(alphabet) + digit
    return result


def encode(id_in: int):
    return base_n_encode(mix(id_in))


def decode(encoded: str):
    return unmix(base_n_decode(encoded))


if __name__ == '__main__':
    e1 = encode(13892359163211)
    e2 = encode(13992351216421)
    print('13892359163211 -> ' + e1)
    print('13992351216421 -> ' + e2)
    print(e1 + ' -> ' + str(decode(e1)))
    print(e2 + ' -> ' + str(decode(e2)))

Output is:

13892359163211 -> BC33VXN8A
13992351216421 -> D1UOW6SLL
BC33VXN8A -> 13892359163211
D1UOW6SLL -> 13992351216421

Note the use of sha256. This is slow and most definitely overkill, but it has the advantage of being built-in to python and thus a one-liner. Unless you are converting millions of IDs speed shouldn't be an issue, but if it is you can replace func with something much, much faster, maybe Murmur3.

The code is written with hard-coded constants to make it a little easier to see what's going on, but it can be generalized to work with arbitrary length (in bits) IDs and arbitrary alphabets.

A more general version of this example is available on github.

score 1 · Answer 3 · answered Aug 23 '19 at 10:21

1

How about finding crc32 for the input and showing the result in hex?

>>> n = 13892359163211
>>> 
>>> import binascii
>>> hex(binascii.crc32(str(n).encode()))[2:]
'240a831a'

answered Aug 23 '19 at 10:21

Prem Anand

2,469
16
16

a CRC32 is only reversible for values <4Bytes (4294967296 as int or 9999 as string (your approach)) which is way too small for a long database ID. If there were a theoretical CRC128, its outputs would be too long – retnikt Aug 23 '19 at 10:30

score 0 · Answer 4 · answered Aug 23 '19 at 10:43

0

Convert the numeric ID's to binary form (3) and use an encoder (4, 5).

In [1]: import struct, base64

In [2]: i = 13892359163211
Out[2]: 13892359163211

In [3]: struct.pack('L', i)
Out[3]: b'K\r"\x91\xa2\x0c\x00\x00'

In [4]: base64.b85encode(struct.pack('L', i)).decode('ascii')
Out[4]: 'OAR8Cq6`24'

In [5]: base64.b64encode(struct.pack('L', i)).decode('ascii')[:-1]
Out[5]: 'Sw0ikaIMAAA'

Which encoder to use depends on which characters you want to allow.

answered Aug 23 '19 at 10:43

Roland Smith

42,427
3
64
94

this also doesn't solve the problem!!! the length needs to be fixed, and different IDs need to be substantially different! – retnikt Aug 23 '19 at 10:47
The length *is* fixed because of the conversion to 8-byte binary. The only algorithms that comes to mind that produce "substantially different" results for numbers that are close together are hash functions... – Roland Smith Aug 23 '19 at 10:56

score 0 · Answer 5 · answered Aug 23 '19 at 23:45

0

You can use CrypII idea to convert from integer to base64. This will be the shortest

13892359163211 is 4LWL and
13992351216421 is 64yl

answered Aug 23 '19 at 23:45

kelalaka

5,064
5
27
44

Algorithm to turn numeric IDs in to short, different alphanumeric codes

5 Answers5

Linked