13

Upon creating an instance of a given ActiveRecord model object, I need to generate a shortish (6-8 characters) unique string to use as an identifier in URLs, in the style of Instagram's photo URLs (like http://instagram.com/p/P541i4ErdL/, which I just scrambled to be a 404) or Youtube's video URLs (like http://www.youtube.com/watch?v=oHg5SJYRHA0).

What's the best way to go about doing this? Is it easiest to just create a random string repeatedly until it's unique? Is there a way to hash/shuffle the integer id in such a way that users can't hack the URL by changing one character (like I did with the 404'd Instagram link above) and end up at a new record?

Community
  • 1
  • 1
kevboh
  • 5,207
  • 5
  • 38
  • 54
  • May I ask why you want to do this in this way? As more and more records are added the probability of hitting a model instance increases because the random string is so short and it would be trivial and only take seconds to generate and try to access thousands of urls. If you really want security this way you're going to need a long string. – ian Sep 25 '12 at 08:29
  • I want to offer permalinks to resources without exposing their numeric IDs. I can use a longer string if necessary, I just didn't want to use enormous hashes. – kevboh Sep 25 '12 at 13:17
  • It's not a Bad Thing™ to expose an ID (for a permalink) as long as you're not giving access rights based solely on having the key. If you do a check to see if the user should be able to read/write too it'll be ok. Then it's up to you whether you expose the ID or not, or use a hash or not. If you are giving access based solely on a random string in the url, then it needs to be loooong. – ian Sep 25 '12 at 13:23
  • If that's the case then why do Youtube/Instagram/etc generate this style of permalink? I'm really shooting for something like that. – kevboh Sep 25 '12 at 13:57
  • 2
    That's a good question for you to ask on http://security.stackexchange.com/. It's certainly not to protect a resource (as you proved with Instagram). Hashing is not inherently secure, I don't know why you'd think it made your site secure in any way? The reality is you're protecting a publicly accessible resource using an 8 character code. All anyone has to do is generate 8 char codes and do HEAD connects till they get a 200 status back - how long do you think that will take? If it's your only protection it is almost no protection at all. – ian Sep 25 '12 at 19:21
  • 1
    That's a really good point. I'll open up a question there to get an idea of the why for all this. – kevboh Sep 25 '12 at 21:44
  • 1
    http://security.stackexchange.com/questions/20718/what-is-the-purpose-of-using-youtube-style-alphanumeric-ids-as-url-slugs – kevboh Sep 26 '12 at 15:03

3 Answers3

25

Here's a good method with no collision already implemented in plpgsql.

First step: consider the pseudo_encrypt function from the PG wiki. This function takes a 32 bits integer as argument and returns a 32 bits integer that looks random to the human eye but uniquely corresponds to its argument (so that's encryption, not hashing). Inside the function, you may change the formula: (((1366.0 * r1 + 150889) % 714025) / 714025.0) with another function known only by you that produces a result in the [0..1] range (just tweaking the constants will probably be good enough, see below my attempt at doing just that). Refer to the wikipedia article on the Feistel cypher for more theorical explanations.

Second step: encode the output number in the alphabet of your choice. Here's a function that does it in base 62 with all alphanumeric characters.

CREATE OR REPLACE FUNCTION stringify_bigint(n bigint) RETURNS text
    LANGUAGE plpgsql IMMUTABLE STRICT AS $$
DECLARE
 alphabet text:='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789';
 base int:=length(alphabet); 
 _n bigint:=abs(n);
 output text:='';
BEGIN
 LOOP
   output := output || substr(alphabet, 1+(_n%base)::int, 1);
   _n := _n / base; 
   EXIT WHEN _n=0;
 END LOOP;
 RETURN output;
END $$

Now here's what we'd get for the first 10 URLs corresponding to a monotonic sequence:

select stringify_bigint(pseudo_encrypt(i)) from generate_series(1,10) as i;
 stringify_bigint 
------------------
 tWJbwb
 eDUHNb
 0k3W4b
 w9dtmc
 wWoCi
 2hVQz
 PyOoR
 cjzW8
 bIGoqb
 A5tDHb

The results look random and are guaranteed to be unique in the entire output space (2^32 or about 4 billion values if you use the entire input space with negative integers as well). If 4 billion values was not wide enough, you may carefully combine two 32 bits results to get to 64 bits while not loosing unicity in outputs. The tricky parts are dealing correctly with the sign bit and avoiding overflows.

About modifying the function to generate your own unique results: let's change the constant from 1366.0 to 1367.0 in the function body, and retry the test above. See how the results are completely different:

 NprBxb
 sY38Ob
 urrF6b
 OjKVnc
 vdS7j
 uEfEB
 3zuaT
 0fjsab
 j7OYrb
 PYiwJb

Update: For those who can compile a C extension, a good replacement for pseudo_encrypt() is range_encrypt_element() from the permuteseq extension, which has of the following advantages:

  • works with any output space up to 64 bits, and it doesn't have to be a power of 2.

  • uses a secret 64-bit key for unguessable sequences.

  • is much faster, if that matters.

Community
  • 1
  • 1
Daniel Vérité
  • 58,074
  • 15
  • 129
  • 156
  • This is awesome! I'm going to go with a ruby solution for now, but I may move to this in the future. Thanks. – kevboh Sep 26 '12 at 11:44
  • @daniel-verite: Great answer! Works for me except for one small thing: I need to work with bigints. I saw that you wrote the pseudo_encrypt() function yourself: can you tell me how to rewrite it, to allow bigint input instead of int (bigint->bigint instead of int->bigint)? – snøreven Oct 06 '12 at 09:34
  • 1
    Consider the `r1` and `l1` 16 bits blocks inside the function. You want to make them 32 bits instead of 16 and adapt the output accordingly (2x32=>64). If this hint is not sufficient, I'd suggest to create a new question on its own with your version of the code to look at, the comments sections being too small for that. – Daniel Vérité Oct 06 '12 at 12:04
  • Thank you! I'm absolutely not sure: Could you check this? http://stackoverflow.com/questions/12761346/pseudo-encrypt-function-in-plpgsql-that-takes-bigint – snøreven Oct 06 '12 at 16:15
  • This is exactly what I was looking for. I need to put this in the default value of the column. How would I do that? do I put `select stringify_bigint(pseudo_encrypt(i)) from generate_series(1,1) as i;` as the default? – Arya May 09 '19 at 06:49
  • @Arya: you want to create a sequence: `CREATE SEQUENCE myseq;` and use `stringify_bigint(pseudo_encrypt(nextval('myseq')))` as the default for your column. – Daniel Vérité May 09 '19 at 09:27
  • @DanielVérité from what I understand, I would get the same string id, if the sequence number is the same right? I am planning to use my database in a replicated setting, so I don't think this would work for me unfortunately, since sequences are not synced in logical replication. – Arya May 09 '19 at 16:52
  • @Arya: the replicated table uses the values from the source tables after they are generated. It doesn't need the sequence and doesn't care how the column values have been computed. – Daniel Vérité May 09 '19 at 17:29
  • @DanielVérité but if the master goes down, and the slave gets changed to master, then the sequences would cause issues – Arya May 09 '19 at 18:19
  • @Arya: to do a failover with a secondary, people use physical replication (i.e. a standby instance) not logical replication. If you had any auto-numbered primary key (think SERIAL), they would have the same issue, since they're based on sequences. – Daniel Vérité May 09 '19 at 19:16
  • @DanielVérité In the future Posgres will add multi-master logical replication. So I thought it's better not to reply on sequences to make the system future proof. – Arya May 09 '19 at 20:02
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/193117/discussion-between-arya-and-daniel-verite). – Arya May 09 '19 at 21:11
5

You could do something like this:

random_attribute.rb

module RandomAttribute

  def generate_unique_random_base64(attribute, n)
    until random_is_unique?(attribute)
      self.send(:"#{attribute}=", random_base64(n))
    end
  end

  def generate_unique_random_hex(attribute, n)
    until random_is_unique?(attribute)
      self.send(:"#{attribute}=", SecureRandom.hex(n/2))
    end
  end

  private

  def random_is_unique?(attribute)
    val = self.send(:"#{attribute}")
    val && !self.class.send(:"find_by_#{attribute}", val)
  end

  def random_base64(n)
    val = base64_url
    val += base64_url while val.length < n
    val.slice(0..(n-1))
  end

  def base64_url
    SecureRandom.base64(60).downcase.gsub(/\W/, '')
  end
end
Raw

user.rb

class Post < ActiveRecord::Base

  include RandomAttribute
  before_validation :generate_key, on: :create

  private

  def generate_key
    generate_unique_random_hex(:key, 32)
  end
end
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Aaron Gibralter
  • 4,773
  • 3
  • 35
  • 50
  • This is pretty cool. Couldn't I drop the `downcase` to get more strings in a given size space? – kevboh Sep 25 '12 at 23:50
  • @kevboh yeah definitely. I mean the whole thing is a bit hacky but works... for some reason I didn't want the keys to be case-sensitive. – Aaron Gibralter Sep 26 '12 at 04:29
  • I think I'll go with this for now and examine a more complete, collisionless option later. Thanks. – kevboh Sep 26 '12 at 11:44
0

You can hash the id:

Digest::MD5.hexdigest('1')[0..9]
=> "c4ca4238a0"
Digest::MD5.hexdigest('2')[0..9]
=> "c81e728d9d"

But somebody can still guess what you're doing and iterate that way. It's probably better to hash on the content

pguardiario
  • 53,827
  • 19
  • 119
  • 159
  • Hmm... wouldn't this result in a bunch of collisions, though? – kevboh Sep 25 '12 at 13:17
  • @kevboh yeah, but you just have to check for collisions before saving. There's no way you're going to avoid collisions with short hashes... – Aaron Gibralter Sep 25 '12 at 16:26
  • @Aaron Gilbalter - The way instagram does it (not md5) it's 62^10. I don't imagine they worry about collisions with numbers like that :) Even at 16^10 it's ridiculously unlikely. – pguardiario Sep 25 '12 at 18:14
  • 1
    @pguardiario so how do they do it? – kevboh Sep 25 '12 at 18:53
  • doing a simple hash defeats the point. However, you could do a simple hash with a salt which is pretty good in this use case: md5(somesalt[0-9]+) - if you use a decent salt then you're good to go. – John Hunt Jul 21 '16 at 14:46