1

I have an application that is kind of like a URL shortener and need to generate unique URL whenever a user requests.

For this I need a function to map an index/number to a unique string of length n with two requirements:

  1. Two different numbers can not generate same string. In other words as long as i,j<K: f(i) != f(j). K is the number of possible strings = 26^n. (26 is number of characters in English)
  2. Two strings generated by number i and i+1 don't look similar most of the times. For example they are not abcdef1 and abcdef2. (So that users can not predict the pattern and the next IDs)

This is my current code in Python:

chars = "abcdefghijklmnopqrstuvwxyz"

for item in itertools.product(chars, repeat=n):
    print("".join(item))

# For n = 7 generates:
# aaaaaaa
# aaaaaab
# aaaaaac
# ...

The problem with this code is there is no index that I can use to generate unique strings on demand by tracking that index. For example generate 1 million unique strings today and 2 million tomorrow without looping through or collision with the first 1 million.

The other problem with this code is that the strings that are created after each other look very similar and I need them to look random.

One option is to populate a table/dictionary with millions of strings, shuffle them and keep track of index to that table but it takes a lot of memory.

An option is also to check the database of existing IDs after generating a random string to make sure it doesn't exist but the problem is as I get closer to the K (26^n) the chance of collision increases and it wouldn't be efficient to make a lot of check_if_exist queries against the database.

Also if n was long enough I could use UUID with small chance of collision but in my case n is 7.

iman
  • 199
  • 1
  • 14
  • 1
    So basically, you need a unique string for each number that is of length `n`? If I had two of the same numbers, would I have different strings? – Ayush Garg Jan 30 '21 at 22:51
  • Right. same number always generates same string. This is to ensure the new numbers (that are incremented per request) won't generate strings that have collision with the previously generated strings. – iman Jan 30 '21 at 22:59
  • So are you basically asking for if the number was `0`, then you would get `aaaaaaa`, if the number was `1`, you would get `aaaaaab`, etc.? (for `n` = 7) If this is the case, I think I have an idea that will work – Ayush Garg Jan 30 '21 at 23:07
  • 1
    Correct, except that I want the aaaaaab to look more different from aaaaaaa so the users can not predict the pattern that what the next generated string is going to be. (requirement # 2) – iman Jan 30 '21 at 23:10
  • 3
    You're looking for a unique hash-style ID. The usual coding direction is to turn a string into an integer; you're going in the opposite direction. However, the principles are the same: take some input, manipulate it in a manner that is not easy to invert, and produce an output. You can code any string as an integer, and any integer as a string (base 26 <=> decimal conversion), so those algorithms would solve your problem. Please continue your research. – Prune Jan 30 '21 at 23:13
  • 1
    @Prune worded it perfectly. In order to "manipulate it in a manner that is not easy to invert", you could use some obscure math equations which take a number as an input and output a character, and tile that to use `n` times (with a diff. math equation each time or a modification of the original one). This would make it seem random, but keep the same string for same numbers. – Ayush Garg Jan 30 '21 at 23:15
  • @iman You basically need two ingredients: 1. a (bijective) chaotic map that works on discrete domains (see e.g. [this paper](https://ieeexplore.ieee.org/document/9115620) or [this one](https://link.springer.com/article/10.1007/s11071-020-05503-y) if you can tolerate collisions) and 2. a way to compute the n-th permutation of a string 'on the fly' (i.e. without computing the preceding ones); the latter is fairly straightforward, see e.g. [this question](https://math.stackexchange.com/q/60742/165080). The difficulty of this task really comes down to the question if you can tolerate collisions. – a_guest Jan 30 '21 at 23:22

4 Answers4

2

I'm going to outline a solution for you that is going to resist casual inspection even by a knowledgeable person, though it probably IS NOT cryptographically secure.

First, your strings and numbers are in a one-to-one map. Here is some simple code for that.

alphabet = 'abcdefghijklmnopqrstuvwxyz'
len_of_codes = 7
char_to_pos = {}
for i in range(len(alphabet)):
    char_to_pos[alphabet[i]] = i

def number_to_string(n):
    chars = []
    for _ in range(len_of_codes):
        chars.append(alphabet[n % len(alphabet)])
        n = n // len(alphabet)
    return "".join(reversed(chars))

def string_to_number(s):
    n = 0
    for c in s:
        n = len(alphabet) * n + char_to_pos[c]
    return n

So now your problem is how to take an ascending stream of numbers and get an apparently random stream of numbers out of it instead. (Because you know how to turn those into strings.) Well, there are lots of tricks for primes, so let's find a decent sized prime that fits in the range that you want.

def is_prime (n):
    for i in range(2, n):
        if 0 == n%i:
            return False
        elif n < i*i:
            return True
    if n == 2:
        return True
    else:
        return False

def last_prime_before (n):
    for m in range(n-1, 1, -1):
        if is_prime(m):
            return m

print(last_prime_before(len(alphabet)**len_of_codes)

With this we find that we can use the prime 8031810103. That's how many numbers we'll be able to handle.

Now there is an easy way to scramble them. Which is to use the fact that multiplication modulo a prime scrambles the numbers in the range 1..(p-1).

def scramble1 (p, k, n):
    return (n*k) % p

Picking a random number to scramble by, int(random.random() * 26**7) happened to give me 3661807866, we get a sequence we can calculate with:

for i in range(1, 5):
    print(number_to_string(scramble1(8031810103, 3661807866, i)))

Which gives us

lwfdjoc
xskgtce
jopkctb
vkunmhd

This looks random to casual inspection. But will be reversible for any knowledgeable someone who puts modest effort in. They just have to guess the prime and algorithm that we used, look at 2 consecutive values to get the hidden parameter, then look at a couple of more to verify it.

Before addressing that, let's figure out how to take a string and get the number back. Thanks to Fermat's little theorem we know for p prime and 1 <= k < p that (k * k^(p-2)) % p == 1.

def n_pow_m_mod_k (n, m, k):
    answer = 1
    while 0 < m:
        if 1 == m % 2:
            answer = (answer * n) % k
        m = m // 2
        n = (n * n) % k
    return answer

print(n_pow_m_mod_k(3661807866, 8031810103-2, 8031810103))

This gives us 3319920713. Armed with that we can calculate scramble1(8031810103, 3319920713, string_to_number("vkunmhd")) to find out that vkunmhd came from 4.

Now let's make it harder. Let's generate several keys to be scrambling with:

import random
p = 26**7
for i in range(5):
    p = last_prime_before(p)
    print((p, int(random.random() * p)))

When I ran this I happened to get:

(8031810103, 3661807866)
(8031810097, 3163265427)
(8031810091, 7069619503)
(8031809963, 6528177934)
(8031809917, 991731572)

Now let's scramble through several layers, working from smallest prime to largest (this requires reversing the sequence):

def encode (n):
    for p, k in [
            (8031809917, 991731572)
          , (8031809963, 6528177934)
          , (8031810091, 7069619503)
          , (8031810097, 3163265427)
          , (8031810103, 3661807866)
        ]:
        n = scramble1(p, k, n)
    return number_to_string(n)

This will give a sequence:

ehidzxf
shsifyl
gicmmcm
ofaroeg

And to reverse it just use the same trick that reversed the first scramble (reversing the primes so I am unscrambling in the order that I started with):

def decode (s):
    n = string_to_number(s)
    for p, k in [
            (8031810103, 3319920713)
          , (8031810097, 4707272543)
          , (8031810091, 5077139687)
          , (8031809963, 192273749)
          , (8031809917, 5986071506)
        ]:
        n = scramble1(p, k, n)

    return n

TO BE CLEAR I do NOT promise that this is cryptographically secure. I'm not a cryptographer, and I'm aware enough of my limitations that I know not to trust it.

But I do promise that you'll have a sequence of over 8 billion strings that you are able to encode/decode with no obvious patterns.

Now take this code, scramble the alphabet, regenerate the magic numbers that I used, and choose a different number of layers to go through. I promise you that I personally have absolutely no idea how someone would even approach the problem of figuring out the algorithm. (But then again I'm not a cryptographer. Maybe they have some techniques to try. I sure don't.)

btilly
  • 43,296
  • 3
  • 59
  • 88
1

How about :

from random import Random

n = 7


def f(i):
    myrandom = Random()
    myrandom.seed(i)
    alphabet = "123456789"
    return "".join([myrandom.choice(alphabet) for _ in range(7)])


# same entry, same output
assert f(0) == "7715987"
assert f(0) == "7715987"
assert f(0) == "7715987"

# different entry, different output
assert f(1) == "3252888"

(change the alphabet to match your need)

This "emulate" a UUID, since you said you could accept a small chance of collision. If you want to avoid collision, what you really need is a perfect hash function (https://en.wikipedia.org/wiki/Perfect_hash_function).

Ayush Garg
  • 2,234
  • 2
  • 12
  • 28
Ribodou
  • 61
  • 1
  • 6
0

you can try something based on the sha1 hash

#!/usr/bin/python3

import hashlib

def generate_link(i):
    n = 7
    a = "abcdefghijklmnopqrstuvwxyz01234567890"
    return "".join(a[x%36] for x in hashlib.sha1(str(i).encode('ascii')).digest()[-n:])
SR3142
  • 550
  • 3
  • 9
  • Note that if you want the original number to not be able to be reverse engineered, `sha1` is not secure enough nowadays. See [this](https://stackoverflow.com/questions/16713810/how-secure-is-md5-and-sha1) stackoverflow question for more information. – Ayush Garg Jan 30 '21 at 23:23
  • If you do something based on hashing, you're going to face a birthday problem. Which is that somewhere around the sqrt of your possible search space you'll start getting high odds of collision. Which with this search space is around 90k. And the expected number of collisions rises quadratically after that. – btilly Jan 31 '21 at 00:33
  • @btilly i agree that you could end up with duplicates but this is no different to using the random module. i ran a test to generate the first 1m links and with sha1 and n = 7 and there were 7 duplicates, with n = 8 there were 0 duplicates, it runs in 2s. using random also with n = 7 i got 7 duplicates again but it took 90s to run and with n = 8 there was 1 duplicate. – SR3142 Feb 02 '21 at 19:23
  • @btilly just read your answer, that is a great algorithm, very fast and produces no duplicates. – SR3142 Feb 02 '21 at 19:47
0

This is a really simple example of what I outlined in this comment. It just offsets the number based on i. If you want "different" strings, don't use this, because if num is 0, then you will get abcdefg (with n = 7).

alphabet = "abcdefghijklmnopqrstuvwxyz"

# num is the num to convert, i is the "offset"
def num_to_char(num, i):
    return alphabet[(num + i) % 26]

# generate the link
def generate_link(num, n):
    return "".join([num_to_char(num, i) for i in range(n)])

generate_link(0, 7) # "abcdefg"
generate_link(0, 7) # still "abcdefg"
generate_link(0, 7) # again, "abcdefg"!

generate_link(1, 7) # now it's "bcdefgh"!

You would just need to change the num + i to some complicated and obscure math equation.

Ayush Garg
  • 2,234
  • 2
  • 12
  • 28