I've searched around for a while now on how to generate a shortened url (e.g. how bit.ly or goo.gl work) but have been unsuccessful.
I presumed it would be something like:
baseN(hash(long_url))
But I always end up with a very long digest instead of something short like 6 characters.
Is it safe to just truncate the digest before encoding it (is encoding it even necessary - I believe it is for making it URL 'safe' but wanted to ask) and is there not a possibility of collisions when only dealing with six characters?
It seems like (warning: I don't know maths) a factorial of 6! (e.g. 6*5*4*3*2*1
) would result in only 720 combinations.
I also remember reading somewhere that with a hash table of 100k items, that a rough calculation for the number of collisions could yield ~17% chance of collision. That feels like a pretty large percentage to me?
The following Python code is based off my understanding of how I might do this type of url shortening:
import hashlib, base64
message = hashlib.sha512()
message.update("https://www.python.org/dev/peps/pep-0537/")
base64.urlsafe_b64encode(
message.hexdigest().encode("utf-8")
)[:6].decode("utf-8")