1

I have a table with a varchar column named key, which is supposed to hold a unique, 8-char random string, which is going to be used as an unique identifier by users. This field should be generated and saved on creation of objects, I have a question about how to create it:

Most of recommendations point to UUID field, but it's not applicable for me because it's too long, and if just get a subset of it then there's no guarantee of uniqueness.

Currently I've just implemented a loop in my backend (not DB), which generates a random string and tries to insert it to DB, and retries if the string turns out to be not unique. But I feel that this is just a really bad practice.

What's the best way to do this?

I'm using Postgresql 9.6

UPDATE: My main concern is to remove the loop that retries to find a random, short string (or number, doesn't matter) that is unique in that table. AFAIK the solution should be a way to generate the string in DB itself. The only thing that I can find for Postgresql is uuid and uuid-ossp that does something like this, but uuid is way too long for my application, and I don't know of any way to have a shorter representation of uuid without compromising it's uniqueness (and I don't think it's possible theoretically).
So, how can I remove the loop and it's back-and-forth to DB?

sazary
  • 906
  • 1
  • 9
  • 20
  • 1
    The way I've done it in the past is by combining the users I'd with current timestamp or some other string and hashing it. This way it will always be unique and no one will know the users I'd (not that it really matters) – Michael Mano Jul 02 '19 at 07:38
  • 1
    You are looking for a collision-free hash function, which is complicated and probably slow to calculate. Just use an `uuid` or a sequence; an artificial key is preferable in this case. I wonder why you think that a 16-byte `uuid` is too long. Don't forget - the shorter, the more likely collisions are, and the more complicated the algorithm would have to be. Don't go into premature optimization, that is one of the sources of evil. – Laurenz Albe Jul 02 '19 at 07:53
  • 1
    An actual *hash* value is *not* random but deterministic, derived from input. So which is it in your case? Random or hash? (Derived from what input exactly?) – Erwin Brandstetter Jul 02 '19 at 08:13
  • @MichaelMano all of the hashing algorithms that I know produce hashes that are longer that what I need, and just simple substring of them don't have the uniqueness guarantee, am I right? – sazary Jul 02 '19 at 10:16
  • @PeterO. yes users will see it and use it in their communications. Auto-increments are not applicable because they're predictable and expose too much info (how many rows and...). I read the link you mentioned, thank you for that, but AFAIK it's generating it in backend, and it essentially involves the bad practice that I already have (generating, checking for uniqueness, generating again...), am I right? – sazary Jul 02 '19 at 10:20
  • @LaurenzAlbe users will see it, and they need it to communicate with each other, so it has to be short, unique, and random. So uuid is not applicable, unless there is a way that we could shorten the uuid without comprising it's uniqueness, which I think is impossible – sazary Jul 02 '19 at 10:22
  • @ErwinBrandstetter thanks for pointing it out; it's just random – sazary Jul 02 '19 at 10:23
  • 1
    @sazary: If it's just random, why not use a plain `bigserial` (8 bytes) instead? Dead simple and reliable. – Erwin Brandstetter Jul 02 '19 at 10:25
  • @ErwinBrandstetter data format isn't a concern, the concern is how to generate a unique, random, short ID without the loop in the backend I described in my question. Does using bigserial help with it? – sazary Jul 02 '19 at 10:28
  • 1
    A `bigserial` (or `IDENTITY`) column provides a unique number per row automatically. Not exactly random, as the name suggests. But perfectly dealing with all possible complications from concurrent writes. See: https://stackoverflow.com/a/9875517/939860 – Erwin Brandstetter Jul 02 '19 at 10:33
  • bigserial isn't random, is simply auto-incremented, so it's predictable and exposes too much info. I can't use an auto-incremented column (BTW I already have one such field in that table) for this reason. randomness and unpredictability are parts of requirement. – sazary Jul 02 '19 at 10:38
  • A UUID is only 16 byte long - why is that too long? –  Jul 02 '19 at 10:38
  • @a_horse_with_no_name it would be read and used by users, so the shorter the better, and there is a limit – sazary Jul 02 '19 at 10:43
  • @sazary: You might fix your question (and your column name) at this point, as it is not about a "hash" at all. It's about generating random 8-character strings without collision. Shortened UUIDs, basically, with are hardly "universally unique" any more due to the reduced key space. – Erwin Brandstetter Jul 02 '19 at 10:45
  • @ErwinBrandstetter thanks, did it. Now, how can I have those shortened UUIDs generated in DB itself? – sazary Jul 02 '19 at 10:58
  • Important but missing defining bits: Are there concurrent writes? Range of characters in the string? Are just hex digits ok? Size of table? Cardinality? Performance requirements? Can you install additional modules? – Erwin Brandstetter Jul 04 '19 at 22:23

1 Answers1

0

Encryption is guaranteed unique, it has to be otherwise decryption would not work. Provided you encrypt unique inputs, such as 0, 1, 2, 3, ... then you are guaranteed unique outputs.

You want 8 characters. You have 62 characters to play with: A-Z, a-z, 0-9 so convert your binary output from the encryption to a base 62 number.

You may need to use the cycle walking technique from Format-preserving encryption to handle a few cases.

rossum
  • 15,344
  • 1
  • 24
  • 38
  • how does converting it to base 62 shorten it without compromising it's uniqueness? I don't think it's theoretically possible – sazary Jul 02 '19 at 10:25
  • Most encryptions will give you a binary number. Just convert that number from base 2 to base 62. Base conversions are a standard mathematical process, and are even built into some programming languages. The number remains unique, merely its expression changes. 0b1010 = 0xA = 16. – rossum Jul 02 '19 at 10:31
  • yes it remains unique if you don't shorten it with some kind of substringing, but I need the result to be less than 8 characters. I don't think this helps. – sazary Jul 02 '19 at 10:35
  • Use a 32 bit encryption, that will guarantee you less than 8 characters in base 63. Any value up to 218340105584896 comes in at less than 8 characters. – rossum Jul 02 '19 at 14:01