0

I've searched around for a while now on how to generate a shortened url (e.g. how bit.ly or goo.gl work) but have been unsuccessful.

I presumed it would be something like:

baseN(hash(long_url))

But I always end up with a very long digest instead of something short like 6 characters.

Is it safe to just truncate the digest before encoding it (is encoding it even necessary - I believe it is for making it URL 'safe' but wanted to ask) and is there not a possibility of collisions when only dealing with six characters?

It seems like (warning: I don't know maths) a factorial of 6! (e.g. 6*5*4*3*2*1) would result in only 720 combinations.

I also remember reading somewhere that with a hash table of 100k items, that a rough calculation for the number of collisions could yield ~17% chance of collision. That feels like a pretty large percentage to me?

The following Python code is based off my understanding of how I might do this type of url shortening:

import hashlib, base64

message = hashlib.sha512()
message.update("https://www.python.org/dev/peps/pep-0537/")

base64.urlsafe_b64encode(
    message.hexdigest().encode("utf-8")
)[:6].decode("utf-8")
Jono20201
  • 3,215
  • 3
  • 20
  • 33
Integralist
  • 5,899
  • 5
  • 25
  • 42
  • 1
    No, it's not a factorial. It's an exponent. 62 different characters in 6 positions, is 62^6 is 56.800.235.584 possible unique strings. URL shorterners don't use hashing, they use random number generators and fast dupe checking. – Martijn Pieters Mar 20 '18 at 09:11
  • Of course it isn’t save truncating anything from such a hash ... that’s like truncating the last name and street address from every entry in the phone book, and still expecting “Adam” to refer to one single unique person afterwards … Get away from thinking “random unique identifier” would need to imply any hashing. – CBroe Mar 20 '18 at 09:16
  • 1
    To shorten something and then restore the original from the shortened version would be a *compression* algorithm. But those are not very effective on short strings like URLs when you're still confined to expressing the compressed value using a subset of ASCII. – deceze Mar 20 '18 at 09:46
  • Possible duplicate of [How to code a URL shortener?](https://stackoverflow.com/questions/742013/how-to-code-a-url-shortener) – Prune Mar 20 '18 at 20:43

2 Answers2

3

There is no effective function to do this. You need to:

  1. Store the URL in a database
  2. Generate a unique ID (or if you already have the url, reuse the id)
deceze
  • 510,633
  • 85
  • 743
  • 889
0

you may be looking for a bidirectional function as mentioned in How to code a URL shortener?

but I also recommend you to not over-complicate unless it is really a requirement for your scenario

a much simpler approach would be to just keep record of what you've mapped:

... there is no compression algorithm, but there is a lookup and generation algorithm. When a URL shortener gets a new URL, it has to create a new short URL that is not yet taken and return this. It will then store the short URL and the long URL in a key-value store and use this at lookup time.

https://www.quora.com/What-are-the-http-bit-ly-and-t-co-shortening-algorithms

arhak
  • 2,488
  • 1
  • 24
  • 38