1

I recently bought myself a domain for personal URL shortening. And I created a function to generate alphanumeric strings of 4 characters as reference.

BUT

How do I check if they are already used or not? I can't check for every URL if it exists in the database, or is this just the way it works and I have to do it? If so, what if I have 13.000.000 URLs generated (out of 14.776.336). Do I need to keep generating strings until I have found one that is not in the DB yet?

This just doesn't look the right way to do it, anyone who can give me some advise?

Guido Gautier
  • 1,237
  • 9
  • 13
wouterds
  • 881
  • 10
  • 26
  • 4
    Do your strings have to be random, or could you just use `aaaa,aaab,aaac,etc`? Then all you have to do is fetch the most recent and increment it. – shanethehat Apr 19 '12 at 12:15
  • They have to be random and not guessable. – wouterds Apr 19 '12 at 13:23
  • If they're only 4 characters long, that's not possible - there are so few 4 character strings that once you've used up more than a few, an attacker can simply guess at random and have a fair chance of finding one soon. – Nick Johnson Apr 20 '12 at 00:32
  • @NickJohnson That's not true. You can generate 14.776.336 different alphanumeric strings with 4 characters. – wouterds Apr 20 '12 at 06:02
  • And if you worried about having database 90% full, attacker has a 90% chance of guessing valid string on first try. Really, to make this strings unguessable you have to keep load factor VERY low and finding empty string will not be an issue. – blaze Apr 20 '12 at 07:48
  • Please don't use a signature or tagline. Take a look at the [FAQ](http://stackoverflow.com/faq#signatures) – Guido Gautier Apr 20 '12 at 08:46
  • @WouterDS Right, and once your namespace is 1% full (~140k identifiers), the attacker will only have to guess an average of 100 records before they find one. Would you trust a 4 character password? Same problem. – Nick Johnson Apr 22 '12 at 00:21
  • @NickJohnson well they can try to find urls by putting random chars behind the url but I don't want them to know the next and the previous url.. Like if it is aaa5 logical next one would be aaa6 and the previous one would be aaa4.. That's the only thing I don't want them to know. – wouterds Apr 22 '12 at 09:32
  • @WouterDS Now you're changing the rules. Why on earth would that matter, though? – Nick Johnson Apr 23 '12 at 00:14

3 Answers3

3

One memory efficient and faster way I think of is following. This problem can be solved without use of database at all. The idea is that instead of storing used urls in database, you can store them in memory. And since storing them in memory can take a lot of memory usage, so we will use a bit set (an array of bits ) and we only one bit for each url.

  1. For each random string you generate, create a hashcode for that that lies b/w 0 and max number K.
  2. Create a bit set( basically a bit array). Whenever you use some url, set corresponding hash code bit in bit set to 1.
  3. Whenever you generate a new url, see if its hashcode bit is set. If yes, then discard that url and generate a new one. Repeat the process till you get one unused one.

This way you avoid DB forever, your lookups are extremely fast and it takes least amount of memory.

I borrowed the idea from this place

MoveFast
  • 3,011
  • 2
  • 27
  • 53
  • It's a nice principle but you don't need a hash, you can convert the string directly into a number, so that there are no collisions. – biziclop Apr 19 '12 at 12:58
  • I don't really understand it.. So are you saying that I need to generate and insert 14.7 million (62^4) random alphanumeric strings into the database and set a field to true when the code is used? – wouterds Apr 19 '12 at 13:22
  • @WouterDS i have added some more details...basically you can totally get away with database. – MoveFast Apr 19 '12 at 15:12
  • It's not really memory efficient when you're using O(n) bits of memory. – Nick Johnson Apr 20 '12 at 00:33
  • @NickJohnson Using this scheme for say 1million urs, it will take less than 1kB of memory...I hope this every one can manage – MoveFast Apr 20 '12 at 04:49
0

A compromise solution is to generate a random id, and if it is already in the database, find the first empty id that is bigger than it. (Wrapping around if you can't find any empty space in the range above.)

If you don't want the ids to be unguessable (you probably don't if you only use 4 characters), this approach works fine and is quick.

biziclop
  • 48,926
  • 12
  • 77
  • 104
  • Well I don't want the URLs to be guessable. Otherwise I would just use dechex().. – wouterds Apr 19 '12 at 13:18
  • By guessable I meant "finding a valid id by brute force", not finding the id for a specific url. Either way, this algorithm is easy (a single SQL query before insert) and gives you fairly random results. And it's guaranteed to fill in the entire code space without getting really slow. – biziclop Apr 19 '12 at 13:32
0

One algorithm is to try a few times to find a free url of N characters, if still not found, increase N. Start with N=4.

maniek
  • 7,087
  • 2
  • 20
  • 43