Unique string with at least two differences

Question

I am looking for a way how to generate unique random (randomly-looking) alphanumeric string with a constraint that each such string has at least two different characters at non-adjacent positions. The length of the string should support at least 20 milion unique values (7 characters should be more than enough).

Example:

AAAAAAA <- first string
AAAAABB <- does not work (different, but adjacent)
ABAAAAA <- does not work (only one different)
AABAABA <- that works perfectly

I was first thinking about using some standard functions (I am using PostgreSql currently, but I can use Oracle as well). I can generate random string using something like md5(), but I have no idea how to satisfy the other constraint. One idea was to use levenshtein to check each new generated string against all already generated and accept it only when distance will be greater than X, but this seems like very brute force solution. Moreover, leventshtein checks only substitution, so the two characters that are different can still be adjacent.

My current solution:

--PostgreSQL 9.5    
with t as (
select 
  generate_series(1, 200) as id,
  substr('abcdefghijklmnopqrstuvwxyz0123456789', trunc(random() * 36)::integer + 1, 1) ||
  substr('abcdefghijklmnopqrstuvwxyz0123456789', trunc(random() * 36)::integer + 1, 1) ||
  substr('abcdefghijklmnopqrstuvwxyz0123456789', trunc(random() * 36)::integer + 1, 1) ||
  substr('abcdefghijklmnopqrstuvwxyz0123456789', trunc(random() * 36)::integer + 1, 1) ||
  substr('abcdefghijklmnopqrstuvwxyz0123456789', trunc(random() * 36)::integer + 1, 1) ||
  substr('abcdefghijklmnopqrstuvwxyz0123456789', trunc(random() * 36)::integer + 1, 1) ||
  substr('abcdefghijklmnopqrstuvwxyz0123456789', trunc(random() * 36)::integer + 1, 1)     
    as rnd_string
)


select distinct id, rnd_string from (

select t1.id, t1.rnd_string, levenshtein(t1.rnd_string, t2.rnd_string)
from t  t1
join t  t2 on t1.id < t2.id
where levenshtein(t1.rnd_string, t2.rnd_string) > 3
) x

order by id

With 200 IDs it filters only one or two strings from the list, but this will grow with more records.

Related questions:

*at least two different characters*? What md5input is going to return just one character? — Evan Carroll, Apr 11 '17 at 14:52
@EvanCarroll If I will generate 20 million unique strings with length of 7, I would like to make sure that there are not any two such string that differ in only one character or two characters that are adjacent. — Tomas Greif, Apr 11 '17 at 15:17
This would be a great [tag:Prolog] question using [tag:clpfd] (Constraint Logic Programming over Finite Domains); to bad you are limiting it to [tag:postgresql] — Guy Coder, Apr 11 '17 at 15:25
I don't know yet if the patterns can be defined using a regular grammar and thus defined using a regular expression. If so then a tool that can generate patterns from a regular expression should work. As I normally do this with Prolog and most people don't want to learn Prolog I found [regldg](https://regldg.com/), but have never used it. A related and interesting question would be is the grammar regular, e.g. Chomsky hierarchy [type-3](https://en.wikipedia.org/wiki/Chomsky_hierarchy#Type-3_grammars) or another Chomsky hierarchy type. — Guy Coder, Apr 11 '17 at 15:36
Can you generate a table with all this data and just query it at will? I've got several algorighms that come very close, however I believe even your lechenstein algorithm is flawed (there may be records that do not meet your requirements)-- assuming your requirements mean that ALL generated strings must differentiate from all other generated strings by 2 non-consecutive characters or more. — Joe Love, Apr 11 '17 at 18:14
Do you want to: a) generate 20 million string at once, b) generate the nth string for the nth user, c) generate the string that comes after a previously generated string x, or d) generate a string at random and check whether it differs enough from your already generated list of strings? — m69's been on strike for years, Apr 11 '17 at 20:56
I will distribute the codes to customers. As I have these customers already in the database, I can generate the codes at once (in batch). There is however a possibility that I will need to add another batch of codes to the already existing one in the future. @JoeLove - yes, I can generate the table upfront and query whenever I want. — Tomas Greif, Apr 11 '17 at 21:04

m69's been on strike for years · Accepted Answer · 2017-04-19T16:13:21.217

There are 78,364,164,096 seven-digit numbers in base-36. If you run through them with a step-size of 101 in base-36 (or 1297 in decimal), you still have 60,419,555 different numbers which will all be different in at least 2 non-consecutive places.

aaaaaaa
aaaabab
aaaacac
...
aaaa9a9
aaababa
aaabbbb
aaabcbc
aaabdbd
...
aaab9b9
aaacaca
aaacbcb
aaacccc
...
aaa9999
aababaa
aabacab
aabadac
aabaead
...

If you think the pattern is too obvious, take 20 million consecutive numbers starting somewhere halfway through the range, so that nobody has a number beginning with "aaaa", or use a different step-size, like 107 (see update below).

About step-sizes other than 101₃₆ (1297):

Digits in base-36 have these periods:

digit:    1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
period:  36  18  12   9  36   6  36   9   4  18  36   3  36  18  12   9  36   2

digit:   19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
period:  36   9  12  18  36   3  36  18   4   9  36   6  36   9  12  18  36

When choosing a step-size of the shape X0Y₃₆, the smallest digit Y should never have a period that is less than the period of the third digit X; e.g. the step-size 106₃₆ (1302) would be a bad choice, because after six steps you'd get a difference only in two adjacent digits:

aaaaaaa, aaaabag, aaaacam, aaaadas, aaaaeay, aaaafa4, aaaagba

If you want to have at least 20 million results, the maximum step-size is 3.0.29₃₆ (3917), so you get these options:

1.0.1  (1297)  1.0.5  (1301)  1.0.7  (1303)  1.0.11 (1307)  1.0.13 (1309)  1.0.17 (1313)  
1.0.19 (1315)  1.0.23 (1319)  1.0.25 (1321)  1.0.29 (1325)  1.0.31 (1327)  1.0.35 (1331)  
2.0.1  (2593)  2.0.2  (2594)  2.0.5  (2597)  2.0.7  (2599)  2.0.10 (2602)  2.0.11 (2603)  
2.0.13 (2605)  2.0.14 (0606)  2.0.17 (2609)  2.0.19 (2611)  2.0.22 (2614)  2.0.23 (2615)  
2.0.25 (2617)  2.0.26 (2618)  2.0.29 (2621)  2.0.31 (2623)  2.0.34 (2626)  2.0.35 (2627)  
3.0.1  (3889)  3.0.2  (3890)  3.0.3  (3891)  3.0.5  (3893)  3.0.7  (3895)  3.0.10 (3898)  
3.0.11 (3899)  3.0.13 (3901)  3.0.14 (3902)  3.0.15 (3903)  3.0.17 (3905)  3.0.19 (3907)  
3.0.21 (3909)  3.0.22 (3910)  3.0.23 (3911)  3.0.25 (3913)  3.0.26 (3914)  3.0.29 (3917)

Additionally, the second digit can be given a value different from zero, e.g. 1.5.7₃₆ (1483) or 1.25.7₃₆ (2203) in order to further obfuscate the pattern.

An easy way to further increase perceived randomness is to shuffle the alphabet. The example below brings all the ideas together: it uses a shuffled alphabet, a step-size of 1.13.25₃₆ (1789) and starts at 0.1.2.3.4.5.6₃₆. Run the code snippet to see the first 10,000 of more than 40 million strings.

var alphabet = "nes2jf7tkd4ha6grlz9qm0bxp8w1ovi3u5cy";
var base = 36;
var digits = 7;
var step = 1789;                                // 1.13.25 (base-36)
var number = step * Math.ceil(63970746 / step); // skip to 0.1.2.3.4.5.6 (base-36)
for (var i = 0; i < 10000; i++) {
    var n = number;
    var str = "";
    for (var j = 0; j < digits; j++) {
        var d = n % base;
        n = (n - d) / base;
        str = alphabet.charAt(d) + str;
    }
    document.write(("nnnnnn" + str).substr(-digits) + ", "); // pad with zero character
    number += step;
}

Since the method is reversible, and you can convert every string back to a number which should be a multiple of the step-size, you can use this as a simple first check to see whether id strings entered in e.g. online forms are valid.

score 1 · Answer 2 · answered Apr 11 '17 at 22:04

To me that sounds very much like Hamming Code. However you still have to ensure that the generated base words do not collide. on a real random device you have to check for equality. Instead of checking you could also try to find a deterministic pseudo random sequence.

score 1 · Answer 3 · answered Apr 12 '17 at 18:30

CREATE OR REPLACE FUNCTION number_to_base(num BIGINT, base INTEGER)
  RETURNS TEXT
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
WITH RECURSIVE n(i, n, r) AS (
    SELECT -1, num, 0
  UNION ALL
    SELECT i + 1, n / base, (n % base)::INT
    FROM n
    WHERE n > 0
)
SELECT string_agg(ch, '')
FROM (
  SELECT CASE
           WHEN r=0 then 'z'
           WHEN r=1 then 'q'
           WHEN r=2 then 'w'
           WHEN r=3 then 'k'
           WHEN r=4 then 's'
           WHEN r=5 then 'g'
           WHEN r=6 then 'v'
           WHEN r=7 then '2'
           WHEN r=8 then '7'
           WHEN r=9 then 'l'
           WHEN r=10 then 'b'
           WHEN r=11 then 'p'
           WHEN r=12 then 'n'
           WHEN r=13 then 'h'
           WHEN r=14 then '1'
           WHEN r=15 then '3'
           WHEN r=16 then 'm'
           WHEN r=17 then 'o'
           WHEN r=18 then 'e'
           WHEN r=19 then 'u'
           WHEN r=20 then 'r'
           WHEN r=21 then 'i'
           WHEN r=22 then '4'
           WHEN r=23 then 'j'
           WHEN r=24 then 'y'
           WHEN r=25 then '0'
           WHEN r=26 then 'd'
           WHEN r=27 then 'x'
           WHEN r=28 then 'f'
           WHEN r=29 then '9'
           WHEN r=30 then '5'
           WHEN r=31 then '8'
           WHEN r=32 then '6'
           WHEN r=33 then 't'
           WHEN r=34 then 'c'
           WHEN r=35 then 'a'
           WHEN r=36 then 'C'
           WHEN r=37 then 'E'
           WHEN r=38 then 'Z'          
           WHEN r=39 then 'H'
           WHEN r=40 then 'Y'
           WHEN r=41 then 'I'
           WHEN r=42 then 'W'
           WHEN r=43 then 'Q'
           WHEN r=44 then 'M'
           WHEN r=45 then 'L'
           WHEN r=46 then 'P'
           WHEN r=47 then 'O'
           WHEN r=48 then 'K'
           WHEN r=49 then 'X'
           WHEN r=50 then 'S'
           WHEN r=51 then 'A'
           WHEN r=52 then 'U'
           WHEN r=53 then 'R'
           WHEN r=54 then 'V'
           WHEN r=55 then 'G'
           WHEN r=56 then 'B'
           WHEN r=57 then 'D'
           WHEN r=58 then 'N'
           WHEN r=59 then 'J'
           WHEN r=60 then 'F'
           WHEN r=61 then 'T'
           ELSE '%'
         END ch
  FROM n
  WHERE i >= 0
  ORDER BY i DESC
) ch
$function$;


select 
number_to_base((((gs % 
9)+1)::text||reverse(((gs*107)+20000000)::text))::bigint,62), gs
from generate_series(1,1000) gs

The 107 is used to create 2 character differences. The 20 million hard coded in is used to make sure you have 20 Mil codes that all have the same number of digits. The other features are used to help make the data appear pseudo random, however the procedure is totally able to be reversed using the data, or re-generated based on the ID.

I limited the results to 1000 simply for testings sake. It's much longer, but more random looking than other solutions. Probably a more elegant way, but this is fast and repeatable.

Unique string with at least two differences

3 Answers3