705

I want to create a URL shortener service where you can write a long URL into an input field and the service shortens the URL to "http://www.example.org/abcdef".

Instead of "abcdef" there can be any other string with six characters containing a-z, A-Z and 0-9. That makes 56~57 billion possible strings.

My approach:

I have a database table with three columns:

  1. id, integer, auto-increment
  2. long, string, the long URL the user entered
  3. short, string, the shortened URL (or just the six characters)

I would then insert the long URL into the table. Then I would select the auto-increment value for "id" and build a hash of it. This hash should then be inserted as "short". But what sort of hash should I build? Hash algorithms like MD5 create too long strings. I don't use these algorithms, I think. A self-built algorithm will work, too.

My idea:

For "http://www.google.de/" I get the auto-increment id 239472. Then I do the following steps:

short = '';
if divisible by 2, add "a"+the result to short
if divisible by 3, add "b"+the result to short
... until I have divisors for a-z and A-Z.

That could be repeated until the number isn't divisible any more. Do you think this is a good approach? Do you have a better idea?

Due to the ongoing interest in this topic, I've published an efficient solution to GitHub, with implementations for JavaScript, PHP, Python and Java. Add your solutions if you like :)

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
caw
  • 30,999
  • 61
  • 181
  • 291
  • 5
    @gudge The point of those functions is that they have an inverse function. This means you can have both `encode()` and `decode()` functions. The steps are, therefore: (1) Save URL in database (2) Get unique row ID for that URL from database (3) Convert integer ID to short string with `encode()`, e.g. `273984` to `f5a4` (4) Use the short string (e.g. `f4a4`) in your sharable URLs (5) When receiving a request for a short string (e.g. `20a8`), decode the string to an integer ID with `decode()` (6) Look up URL in database for given ID. For conversion, use: https://github.com/delight-im/ShortURL – caw Feb 10 '15 at 10:31
  • @Marco, what's the point of storing the hash in the database? – Maksim Vi. Jul 11 '15 at 09:04
  • 3
    @MaksimVi. If you have an invertible function, there's none. If you had a one-way hash function, there would be one. – caw Jul 14 '15 at 14:47
  • 2
    would it be wrong if we used simple CRC32 algorithm to shorten a URL? Although very unlikely of a collision (a CRC32 output is usually 8 characters long and that gives us over 30 million possibilities) If a generated CRC32 output was already used previously and was found in the database, we could salt the long URL with a random number until we find a CRC32 output which is unique in my database. How bad or different or ugly would this be for a simple solution? – Rakib Mar 22 '16 at 09:41
  • [Typical number to short string conversion approach in Java](https://github.com/aniket91/DataStructures/blob/master/src/com/osfg/questions/ShortURLGenerator.java) – Aniket Thakur May 14 '16 at 09:41

30 Answers30

865

I would continue your "convert number to string" approach. However, you will realize that your proposed algorithm fails if your ID is a prime and greater than 52.

Theoretical background

You need a Bijective Function f. This is necessary so that you can find a inverse function g('abc') = 123 for your f(123) = 'abc' function. This means:

  • There must be no x1, x2 (with x1 ≠ x2) that will make f(x1) = f(x2),
  • and for every y you must be able to find an x so that f(x) = y.

How to convert the ID to a shortened URL

  1. Think of an alphabet we want to use. In your case, that's [a-zA-Z0-9]. It contains 62 letters.
  2. Take an auto-generated, unique numerical key (the auto-incremented id of a MySQL table for example).

    For this example, I will use 12510 (125 with a base of 10).

  3. Now you have to convert 12510 to X62 (base 62).

    12510 = 2×621 + 1×620 = [2,1]

    This requires the use of integer division and modulo. A pseudo-code example:

    digits = []
    
    while num > 0
      remainder = modulo(num, 62)
      digits.push(remainder)
      num = divide(num, 62)
    
    digits = digits.reverse
    

    Now map the indices 2 and 1 to your alphabet. This is how your mapping (with an array for example) could look like:

    0  → a
    1  → b
    ...
    25 → z
    ...
    52 → 0
    61 → 9
    

    With 2 → c and 1 → b, you will receive cb62 as the shortened URL.

    http://shor.ty/cb
    

How to resolve a shortened URL to the initial ID

The reverse is even easier. You just do a reverse lookup in your alphabet.

  1. e9a62 will be resolved to "4th, 61st, and 0th letter in the alphabet".

    e9a62 = [4,61,0] = 4×622 + 61×621 + 0×620 = 1915810

  2. Now find your database-record with WHERE id = 19158 and do the redirect.

Example implementations (provided by commenters)

Jay Taylor
  • 13,185
  • 11
  • 60
  • 85
Marcel Jackwerth
  • 53,948
  • 9
  • 74
  • 88
  • 20
    Don't forget to sanitize the URLs for malicious javascript code! Remember that javascript can be base64 encoded in a URL so just searching for 'javascript' isn't good enough.j – Bjorn Apr 14 '09 at 08:05
  • @apphacker: Could you shortly explain how to sanitize please? I thought it would be enough to use strip_tags() function in PHP. Or tell me if that cannot be explained shortly, then I'll post it as a new question here. – caw Apr 14 '09 at 18:20
  • 4
    A function must be bijective (injective *and* surjective) to have an inverse. – Gumbo May 04 '10 at 20:28
  • @Gumbo: You are right. An injective function only implies a left inverse, but no inverse. Fixed that. – Marcel Jackwerth May 04 '10 at 20:58
  • 57
    Food for thought, it might be useful to add a two character checksum to the url. That would prevent direct iteration of all the urls in your system. Something simple like f(checksum(id) % (62^2)) + f(id) = url_id – koblas Sep 04 '10 at 13:53
  • Will a bijective function generate a unique output for differing inputs? a ~> x,b ~> y, and c ~> z – Edward Apr 23 '13 at 16:12
  • @Edward Yes, because every bijective function is injective, and injective functions always generate different outputs for different inputs. See http://en.wikipedia.org/wiki/Injective_function for more info. – Marcel Jackwerth Apr 24 '13 at 08:45
  • 3
    Strictly speaking, there's no reason that you can't have the same long URL map to multiple short ones. E.g. when someone shortens a URL, there's no need to check to see if it's already in the table; just add it again. This saves having to do a lookup at creation time, at the cost of table size if lots of people are shortening the same URL. – Edward Falk May 26 '13 at 15:32
  • 7
    As far as sanitizing the urls go, one of the problems you're going to face is spammers using your service to mask their URLS to avoid spam filters. You need to either limit the service to known good actors, or apply spam filtering to the long urls. Otherwise you WILL be abused by spammers. – Edward Falk May 26 '13 at 15:34
  • 78
    Base62 may be a bad choice because it has the potential to generate f* words (for example, `3792586=='F_ck'` with u in the place of _). I would exclude some characters like u/U in order to minimize this. – Paulo Scardine Jun 28 '13 at 16:02
  • 2
    @MarcelJackwerth why can't we just send the auto increment id directly. Is it for security reason or something else? – Trying Jan 11 '14 at 00:27
  • @DanLee better to go with this Bijective class encode database ID to Bijective. and some proper math to hide that ID.. full uniqness – dev.meghraj Jan 22 '14 at 12:36
  • 1
    To somewhat answer @capdragon's question (not really - different algorithm, but it shows how it could be implemented), here is a naive C# class I wrote a couple years back as the innards for squrl.us (which I later abandoned): https://github.com/austegard/Squrl/blob/master/HumanBase32.cs It is based on Crockford's base32, and emphasizes readability and ease of communicating by humans (could be written down, or read to someone over the phone etc, with little chance of transcription errors). – Oskar Austegard Feb 13 '14 at 19:17
  • Mapping the id to base 62 as above means that it's very easy to increment/decrement the result and find more links. Can anyone explain @koblas solution and why that fixes the increment/decrement issue? – divide_by_zero Sep 03 '15 at 10:22
  • 1
    If you don't know the checksum() or f() functions for the URL (e.g. they incorporate a secret) then it would mean that you would have to iterate over 62^2 invalid urls to get one url. It then becomes easy to detect this abuse case. – koblas Sep 03 '15 at 18:23
  • I did not get it completely. So , finally solution requires storing original URL anywhere in data base or not ? – Andy897 May 02 '16 at 07:00
  • How would this really scale across multiple servers and databases? – p0lAris Jun 29 '16 at 22:09
  • 2
    Considering the original question, is there any reason you can't make a URL shortener with an injective function rather than requiring a bijective function? As long as you can generate a unique string for each number in the sequence, you could just store a map from the generated string to the long URL directly and increment the sequence for the next generated URL. Why would you need to be able to get the original sequence number from the string? – stuckj Aug 12 '16 at 13:58
  • NodeJS module which implements this algorithm backed by pluggable storage defaulting to Redis: https://www.npmjs.com/package/id-shortener – fisch2 Jun 04 '17 at 03:18
  • Implemented this in golang https://gist.github.com/kareemsabri/d902611eacf3ef659cd7db31123f4407 – Kareem Nov 02 '17 at 01:13
  • How do you handle duplicate urls ? Would a new url get generated for same long url if I hit again? – not-a-robot Apr 07 '18 at 10:14
  • @not-a-robot no even if you entered the same url again because the generated code depends mainly on the autoincremented id from the database so for every new url will be a new code generated – Yahia Bat Aug 25 '18 at 15:47
  • question: why not just use the autoincrement id directly? why even bother encoding it in a different base? – temporary_user_name Feb 26 '19 at 03:41
  • if you start from 0 or 1 and go up, there is a privacy issue because people can run through all your URL IDs and visit all the URLs in your DB. Your users might not want other people to be able to visit such a page – nonopolarity Dec 10 '19 at 07:54
  • 1
    I don't understand why this approach, encoding an auto incremented id, is better than just using Math.random to generate a 7 character string. This approach is way more complicated and provides little to no benefit. It's also a security risk because now the URL is potentially reverse engineerable. – Sean Montana Jan 16 '20 at 17:50
  • @sean-montana Me neither. Maybe most of people want their keys be reversable, or they haven't learned random function yet. – Weihang Jian Jan 06 '21 at 18:04
  • @WeihangJian If you generated a random 7 character string, you'd have to check the database to make sure the string is not already there. If it is, you'd have to create another random string, check again...repeat. – GingerBreadMane Jan 22 '21 at 04:05
  • 1
    @stuckj You could create a URL shortener with an injective function but it would be less efficient. With an injective function we'd have to make two database calls, one initial save to get the auto-incremented ID, and then an update to save the unique string that we create from the auto-incremented ID. The bijective function allows us to skip saving the unique string at all since we can always resolve it to an ID. – GingerBreadMane Jan 22 '21 at 04:09
  • Since this algorithm uses base-x conversion, generated links will never start with the first character in the alphabet. This eliminates a total of x^n-1 + x^n-2 + ... + x^1 + x^0 possible combinations of shortened strings where x is the number of characters in the alphabet and n is the maximum number of characters in the generated string. – GingerBreadMane Jan 22 '21 at 04:17
  • Found this Javascript implementation in npm https://github.com/niieani/hashids.js – esamatti May 09 '21 at 09:42
57

Why would you want to use a hash?

You can just use a simple translation of your auto-increment value to an alphanumeric value. You can do that easily by using some base conversion. Say you character space (A-Z, a-z, 0-9, etc.) has 62 characters, convert the id to a base-40 number and use the characters as the digits.

Neeraj Jain
  • 7,643
  • 6
  • 34
  • 62
shoosh
  • 76,898
  • 55
  • 205
  • 325
  • 14
    asides from the fact that A-Z, a-z and 0-9 = 62 chars, not 40, you are right on the mark. – Evan Teran Apr 12 '09 at 16:39
  • Thanks! Should I use the base-62 alphabet then? http://en.wikipedia.org/wiki/Base_62 But how can I convert the ids to a base-62 number? – caw Apr 12 '09 at 16:46
  • Using a base conversion algorithm ofcourse - http://en.wikipedia.org/wiki/Base_conversion#Change_of_radix – shoosh Apr 12 '09 at 16:48
  • Thank you! That's really simple. :) Do I have to do this until the dividend is 0? Will the dividend always be 0 at some point? – caw Apr 12 '09 at 17:04
  • yes, if you keep dividing an integer n by k, it will reach 0 after a maximum of log_k(n)+1 divisions (try to prove this) – shoosh Apr 12 '09 at 17:34
  • 3
    Regarding "Why would you want to use a hash?", a base conversion based on auto-increment is going to create sequential URLs, so you'd need to be comfortable with people being able to "browse" other peoples' shortened URLs, right? – Andrew Coleson Apr 12 '09 at 18:05
  • Yes, but that's no problem, is it? I don't want to encrypt the URLs but I want to shorten them. – caw Apr 12 '09 at 18:17
  • 4
    with enough resources and time you can "browse" all the URLs of of any URL shortening service. – shoosh Apr 12 '09 at 21:10
  • Yeah. That can be a problem if you don't want people to just scrape the contents of your entire url list. – Agamemnus Mar 07 '18 at 23:13
52
public class UrlShortener {
    private static final String ALPHABET = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
    private static final int    BASE     = ALPHABET.length();

    public static String encode(int num) {
        StringBuilder sb = new StringBuilder();
        while ( num > 0 ) {
            sb.append( ALPHABET.charAt( num % BASE ) );
            num /= BASE;
        }
        return sb.reverse().toString();   
    }

    public static int decode(String str) {
        int num = 0;
        for ( int i = 0; i < str.length(); i++ )
            num = num * BASE + ALPHABET.indexOf(str.charAt(i));
        return num;
    }   
}
Boris
  • 4,944
  • 7
  • 36
  • 69
Stradivariuz
  • 2,523
  • 1
  • 18
  • 6
  • I really like the idea, the only problem i have with it is that i keep getting the num variable in the decode function out of bounds(even for long), do you have any idea how to make it work? or is it theoretical only? – user1322801 May 12 '16 at 19:07
  • @user1322801: Presumably you're trying to decode something that was far larger than what the encode function can actually handle. You could get some more mileage out of it if you converted all of the "ints" to BigInteger, but unless you've got > 9223372036854775807 indexes, long should probably be enough. – biggusjimmus Jul 19 '16 at 06:08
  • 3
    May I know what is the importance of reversing ? i.e. sb.reverse().toString(); – dotNet Decoder Jun 07 '17 at 02:00
  • It that 62^62 = 1.7 trillion? – Noah Tony Aug 24 '18 at 19:30
33

Not an answer to your question, but I wouldn't use case-sensitive shortened URLs. They are hard to remember, usually unreadable (many fonts render 1 and l, 0 and O and other characters very very similar that they are near impossible to tell the difference) and downright error prone. Try to use lower or upper case only.

Also, try to have a format where you mix the numbers and characters in a predefined form. There are studies that show that people tend to remember one form better than others (think phone numbers, where the numbers are grouped in a specific form). Try something like num-char-char-num-char-char. I know this will lower the combinations, especially if you don't have upper and lower case, but it would be more usable and therefore useful.

Ash
  • 699
  • 1
  • 7
  • 6
  • 2
    Thank you, very good idea. I haven't thought about that yet. It's clear that it depends on the kind of use whether that makes sense or not. – caw Apr 12 '09 at 18:22
  • 19
    It won't be an issue if people are strictly copy-and-pasting the short urls. – Edward Falk May 26 '13 at 15:35
  • 2
    The purpose of short url's is not to be memorable or easy to speak. Is only click or copy/paste. – Hugo Nogueira Dec 05 '16 at 14:12
  • yeah I thought the short URL is only for people to list it or email it and so it is short and won't take up 200 characters like some URLs do, so case isn't a problem – nonopolarity Dec 10 '19 at 07:46
30

My approach: Take the Database ID, then Base36 Encode it. I would NOT use both Upper AND Lowercase letters, because that makes transmitting those URLs over the telephone a nightmare, but you could of course easily extend the function to be a base 62 en/decoder.

Michael Stum
  • 177,530
  • 117
  • 400
  • 535
  • Thanks, you're right. Whether you have 2,176,782,336 possibilities or 56,800,235,584, it's the same: Both will be enough. So I will use base 36 encoding. – caw Apr 14 '09 at 18:22
  • It may be obvious but here is some PHP code referenced in wikipedia to do base64 encode in php http://www.tonymarston.net/php-mysql/converter.html – Ryan White Jul 13 '10 at 15:33
8

Here is my PHP 5 class.

<?php
class Bijective
{
    public $dictionary = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";

    public function __construct()
    {
        $this->dictionary = str_split($this->dictionary);
    }

    public function encode($i)
    {
        if ($i == 0)
        return $this->dictionary[0];

        $result = '';
        $base = count($this->dictionary);

        while ($i > 0)
        {
            $result[] = $this->dictionary[($i % $base)];
            $i = floor($i / $base);
        }

        $result = array_reverse($result);

        return join("", $result);
    }

    public function decode($input)
    {
        $i = 0;
        $base = count($this->dictionary);

        $input = str_split($input);

        foreach($input as $char)
        {
            $pos = array_search($char, $this->dictionary);

            $i = $i * $base + $pos;
        }

        return $i;
    }
}
Xeoncross
  • 55,620
  • 80
  • 262
  • 364
6

A Node.js and MongoDB solution

Since we know the format that MongoDB uses to create a new ObjectId with 12 bytes.

  • a 4-byte value representing the seconds since the Unix epoch,
  • a 3-byte machine identifier,
  • a 2-byte process id
  • a 3-byte counter (in your machine), starting with a random value.

Example (I choose a random sequence) a1b2c3d4e5f6g7h8i9j1k2l3

  • a1b2c3d4 represents the seconds since the Unix epoch,
  • 4e5f6g7 represents machine identifier,
  • h8i9 represents process id
  • j1k2l3 represents the counter, starting with a random value.

Since the counter will be unique if we are storing the data in the same machine we can get it with no doubts that it will be duplicate.

So the short URL will be the counter and here is a code snippet assuming that your server is running properly.

const mongoose = require('mongoose');
const Schema = mongoose.Schema;

// Create a schema
const shortUrl = new Schema({
    long_url: { type: String, required: true },
    short_url: { type: String, required: true, unique: true },
  });
const ShortUrl = mongoose.model('ShortUrl', shortUrl);

// The user can request to get a short URL by providing a long URL using a form

app.post('/shorten', function(req ,res){
    // Create a new shortUrl */
    // The submit form has an input with longURL as its name attribute.
    const longUrl = req.body["longURL"];
    const newUrl = ShortUrl({
        long_url : longUrl,
        short_url : "",
    });
    const shortUrl = newUrl._id.toString().slice(-6);
    newUrl.short_url = shortUrl;
    console.log(newUrl);
    newUrl.save(function(err){
        console.log("the new URL is added");
    })
});
Firas Omrane
  • 894
  • 1
  • 14
  • 21
5

I keep incrementing an integer sequence per domain in the database and use Hashids to encode the integer into a URL path.

static hashids = Hashids(salt = "my app rocks", minSize = 6)

I ran a script to see how long it takes until it exhausts the character length. For six characters it can do 164,916,224 links and then goes up to seven characters. Bitly uses seven characters. Under five characters looks weird to me.

Hashids can decode the URL path back to a integer but a simpler solution is to use the entire short link sho.rt/ka8ds3 as a primary key.

Here is the full concept:

function addDomain(domain) {
    table("domains").insert("domain", domain, "seq", 0)
}

function addURL(domain, longURL) {
    seq = table("domains").where("domain = ?", domain).increment("seq")
    shortURL = domain + "/" + hashids.encode(seq)
    table("links").insert("short", shortURL, "long", longURL)
    return shortURL
}

// GET /:hashcode
function handleRequest(req, res) {
    shortURL = req.host + "/" + req.param("hashcode")
    longURL = table("links").where("short = ?", shortURL).get("long")
    res.redirect(301, longURL)
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
AJcodez
  • 31,780
  • 20
  • 84
  • 118
4

C# version:

public class UrlShortener 
{
    private static String ALPHABET = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
    private static int    BASE     = 62;

    public static String encode(int num)
    {
        StringBuilder sb = new StringBuilder();

        while ( num > 0 )
        {
            sb.Append( ALPHABET[( num % BASE )] );
            num /= BASE;
        }

        StringBuilder builder = new StringBuilder();
        for (int i = sb.Length - 1; i >= 0; i--)
        {
            builder.Append(sb[i]);
        }
        return builder.ToString(); 
    }

    public static int decode(String str)
    {
        int num = 0;

        for ( int i = 0, len = str.Length; i < len; i++ )
        {
            num = num * BASE + ALPHABET.IndexOf( str[(i)] ); 
        }

        return num;
    }   
}
user1477388
  • 20,790
  • 32
  • 144
  • 264
4

You could hash the entire URL, but if you just want to shorten the id, do as marcel suggested. I wrote this Python implementation:

https://gist.github.com/778542

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
bhelx
  • 1,107
  • 1
  • 6
  • 6
4

Take a look at https://hashids.org/ it is open source and in many languages.

Their page outlines some of the pitfalls of other approaches.

John
  • 29,788
  • 18
  • 89
  • 130
3

If you don't want re-invent the wheel ... http://lilurl.sourceforge.net/

Alister Bulman
  • 34,482
  • 9
  • 71
  • 110
3
// simple approach

$original_id = 56789;

$shortened_id = base_convert($original_id, 10, 36);

$un_shortened_id = base_convert($shortened_id, 36, 10);
phirschybar
  • 8,357
  • 12
  • 50
  • 66
2
alphabet = map(chr, range(97,123)+range(65,91)) + map(str,range(0,10))

def lookup(k, a=alphabet):
    if type(k) == int:
        return a[k]
    elif type(k) == str:
        return a.index(k)


def encode(i, a=alphabet):
    '''Takes an integer and returns it in the given base with mappings for upper/lower case letters and numbers 0-9.'''
    try:
        i = int(i)
    except Exception:
        raise TypeError("Input must be an integer.")

    def incode(i=i, p=1, a=a):
        # Here to protect p.                                                                                                                                                                                                                
        if i <= 61:
            return lookup(i)

        else:
            pval = pow(62,p)
            nval = i/pval
            remainder = i % pval
            if nval <= 61:
                return lookup(nval) + incode(i % pval)
            else:
                return incode(i, p+1)

    return incode()



def decode(s, a=alphabet):
    '''Takes a base 62 string in our alphabet and returns it in base10.'''
    try:
        s = str(s)
    except Exception:
        raise TypeError("Input must be a string.")

    return sum([lookup(i) * pow(62,p) for p,i in enumerate(list(reversed(s)))])a

Here's my version for whomever needs it.

MrChrisRodriguez
  • 493
  • 6
  • 12
1

Did you omit O, 0, and i on purpose?

I just created a PHP class based on Ryan's solution.

<?php

    $shorty = new App_Shorty();

    echo 'ID: ' . 1000;
    echo '<br/> Short link: ' . $shorty->encode(1000);
    echo '<br/> Decoded Short Link: ' . $shorty->decode($shorty->encode(1000));


    /**
     * A nice shorting class based on Ryan Charmley's suggestion see the link on Stack Overflow below.
     * @author Svetoslav Marinov (Slavi) | http://WebWeb.ca
     * @see http://stackoverflow.com/questions/742013/how-to-code-a-url-shortener/10386945#10386945
     */
    class App_Shorty {
        /**
         * Explicitly omitted: i, o, 1, 0 because they are confusing. Also use only lowercase ... as
         * dictating this over the phone might be tough.
         * @var string
         */
        private $dictionary = "abcdfghjklmnpqrstvwxyz23456789";
        private $dictionary_array = array();

        public function __construct() {
            $this->dictionary_array = str_split($this->dictionary);
        }

        /**
         * Gets ID and converts it into a string.
         * @param int $id
         */
        public function encode($id) {
            $str_id = '';
            $base = count($this->dictionary_array);

            while ($id > 0) {
                $rem = $id % $base;
                $id = ($id - $rem) / $base;
                $str_id .= $this->dictionary_array[$rem];
            }

            return $str_id;
        }

        /**
         * Converts /abc into an integer ID
         * @param string
         * @return int $id
         */
        public function decode($str_id) {
            $id = 0;
            $id_ar = str_split($str_id);
            $base = count($this->dictionary_array);

            for ($i = count($id_ar); $i > 0; $i--) {
                $id += array_search($id_ar[$i - 1], $this->dictionary_array) * pow($base, $i - 1);
            }
            return $id;
        }
    }
?>
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Svetoslav Marinov
  • 1,498
  • 14
  • 11
1
public class TinyUrl {
    
        private final String characterMap = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
        private final int charBase = characterMap.length();
    
        public String covertToCharacter(int num){
            StringBuilder sb = new StringBuilder();
    
            while (num > 0){
                sb.append(characterMap.charAt(num % charBase));
                num /= charBase;
            }
    
            return sb.reverse().toString();
        }
    
        public int covertToInteger(String str){
            int num = 0;
            for(int i = 0 ; i< str.length(); i++)
                num += characterMap.indexOf(str.charAt(i)) * Math.pow(charBase , (str.length() - (i + 1)));
    
            return num;
        }
}
    
class TinyUrlTest{
    
    public static void main(String[] args) {
        TinyUrl tinyUrl = new TinyUrl();
        int num = 122312215;
        String url = tinyUrl.covertToCharacter(num);
        System.out.println("Tiny url:  " + url);
        System.out.println("Id: " + tinyUrl.covertToInteger(url));
    }
}
Hrishikesh Mishra
  • 3,295
  • 3
  • 27
  • 33
1

Why not just translate your id to a string? You just need a function that maps a digit between, say, 0 and 61 to a single letter (upper/lower case) or digit. Then apply this to create, say, 4-letter codes, and you've got 14.7 million URLs covered.

cr333
  • 1,555
  • 1
  • 15
  • 16
  • +1 for the simplistic thinking. It really is that simple. I just posted an answer that is doing exactly this. I have some production code that queries the database to ensure there are no duplicate strings and everything is unique. – Andrew Reese Nov 14 '19 at 18:10
1

Here is a decent URL encoding function for PHP...

// From http://snipplr.com/view/22246/base62-encode--decode/
private function base_encode($val, $base=62, $chars='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ') {
    $str = '';
    do {
        $i = fmod($val, $base);
        $str = $chars[$i] . $str;
        $val = ($val - $i) / $base;
    } while($val > 0);
    return $str;
}
Simon East
  • 55,742
  • 17
  • 139
  • 133
1

Don't know if anyone will find this useful - it is more of a 'hack n slash' method, yet is simple and works nicely if you want only specific chars.

$dictionary = "abcdfghjklmnpqrstvwxyz23456789";
$dictionary = str_split($dictionary);

// Encode
$str_id = '';
$base = count($dictionary);

while($id > 0) {
    $rem = $id % $base;
    $id = ($id - $rem) / $base;
    $str_id .= $dictionary[$rem];
}


// Decode
$id_ar = str_split($str_id);
$id = 0;

for($i = count($id_ar); $i > 0; $i--) {
    $id += array_search($id_ar[$i-1], $dictionary) * pow($base, $i - 1);
} 
Ryan Charmley
  • 1,127
  • 15
  • 18
0

I have a variant of the problem, in that I store web pages from many different authors and need to prevent discovery of pages by guesswork. So my short URLs add a couple of extra digits to the Base-62 string for the page number. These extra digits are generated from information in the page record itself and they ensure that only 1 in 3844 URLs are valid (assuming 2-digit Base-62). You can see an outline description at http://mgscan.com/MBWL.

Graham
  • 107
  • 2
  • 8
0

Very good answer, I have created a Golang implementation of the bjf:

package bjf

import (
    "math"
    "strings"
    "strconv"
)

const alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"

func Encode(num string) string {
    n, _ := strconv.ParseUint(num, 10, 64)
    t := make([]byte, 0)

    /* Special case */
    if n == 0 {
        return string(alphabet[0])
    }

    /* Map */
    for n > 0 {
        r := n % uint64(len(alphabet))
        t = append(t, alphabet[r])
        n = n / uint64(len(alphabet))
    }

    /* Reverse */
    for i, j := 0, len(t) - 1; i < j; i, j = i + 1, j - 1 {
        t[i], t[j] = t[j], t[i]
    }

    return string(t)
}

func Decode(token string) int {
    r := int(0)
    p := float64(len(token)) - 1

    for i := 0; i < len(token); i++ {
        r += strings.Index(alphabet, string(token[i])) * int(math.Pow(float64(len(alphabet)), p))
        p--
    }

    return r
}

Hosted at github: https://github.com/xor-gate/go-bjf

Jerry Jacobs
  • 195
  • 9
0

My Python 3 version

base_list = list("0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
base = len(base_list)

def encode(num: int):
    result = []
    if num == 0:
        result.append(base_list[0])

    while num > 0:
        result.append(base_list[num % base])
        num //= base

    print("".join(reversed(result)))

def decode(code: str):
    num = 0
    code_list = list(code)
    for index, code in enumerate(reversed(code_list)):
        num += base_list.index(code) * base ** index
    print(num)

if __name__ == '__main__':
    encode(341413134141)
    decode("60FoItT")
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
wyx
  • 3,334
  • 6
  • 24
  • 44
0

Here is a Node.js implementation that is likely to bit.ly. generate a highly random seven-character string.

It uses Node.js crypto to generate a highly random 25 charset rather than randomly selecting seven characters.

var crypto = require("crypto");
exports.shortURL = new function () {
    this.getShortURL = function () {
        var sURL = '',
            _rand = crypto.randomBytes(25).toString('hex'),
            _base = _rand.length;
        for (var i = 0; i < 7; i++)
            sURL += _rand.charAt(Math.floor(Math.random() * _rand.length));
        return sURL;
    };
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Hafiz Arslan
  • 453
  • 5
  • 16
0

For a quality Node.js / JavaScript solution, see the id-shortener module, which is thoroughly tested and has been used in production for months.

It provides an efficient id / URL shortener backed by pluggable storage defaulting to Redis, and you can even customize your short id character set and whether or not shortening is idempotent. This is an important distinction that not all URL shorteners take into account.

In relation to other answers here, this module implements the Marcel Jackwerth's excellent accepted answer above.

The core of the solution is provided by the following Redis Lua snippet:

local sequence = redis.call('incr', KEYS[1])

local chars = '0123456789ABCDEFGHJKLMNPQRSTUVWXYZ_abcdefghijkmnopqrstuvwxyz'
local remaining = sequence
local slug = ''

while (remaining > 0) do
  local d = (remaining % 60)
  local character = string.sub(chars, d + 1, d + 1)

  slug = character .. slug
  remaining = (remaining - d) / 60
end

redis.call('hset', KEYS[2], slug, ARGV[1])

return slug
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
fisch2
  • 2,574
  • 2
  • 26
  • 29
0

Implementation in Scala:

class Encoder(alphabet: String) extends (Long => String) {

  val Base = alphabet.size

  override def apply(number: Long) = {
    def encode(current: Long): List[Int] = {
      if (current == 0) Nil
      else (current % Base).toInt :: encode(current / Base)
    }
    encode(number).reverse
      .map(current => alphabet.charAt(current)).mkString
  }
}

class Decoder(alphabet: String) extends (String => Long) {

  val Base = alphabet.size

  override def apply(string: String) = {
    def decode(current: Long, encodedPart: String): Long = {
      if (encodedPart.size == 0) current
      else decode(current * Base + alphabet.indexOf(encodedPart.head),encodedPart.tail)
    }
    decode(0,string)
  }
}

Test example with Scala test:

import org.scalatest.{FlatSpec, Matchers}

class DecoderAndEncoderTest extends FlatSpec with Matchers {

  val Alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"

  "A number with base 10" should "be correctly encoded into base 62 string" in {
    val encoder = new Encoder(Alphabet)
    encoder(127) should be ("cd")
    encoder(543513414) should be ("KWGPy")
  }

  "A base 62 string" should "be correctly decoded into a number with base 10" in {
    val decoder = new Decoder(Alphabet)
    decoder("cd") should be (127)
    decoder("KWGPy") should be (543513414)
  }

}
adrift
  • 637
  • 2
  • 10
  • 34
0

Function based in Xeoncross Class

function shortly($input){
$dictionary = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','0','1','2','3','4','5','6','7','8','9'];
if($input===0)
    return $dictionary[0];
$base = count($dictionary);
if(is_numeric($input)){
    $result = [];
    while($input > 0){
        $result[] = $dictionary[($input % $base)];
        $input = floor($input / $base);
    }
    return join("", array_reverse($result));
}
$i = 0;
$input = str_split($input);
foreach($input as $char){
    $pos = array_search($char, $dictionary);
    $i = $i * $base + $pos;
}
return $i;
}
0

This is what I use:

# Generate a [0-9a-zA-Z] string
ALPHABET = map(str,range(0, 10)) + map(chr, range(97, 123) + range(65, 91))

def encode_id(id_number, alphabet=ALPHABET):
    """Convert an integer to a string."""
    if id_number == 0:
        return alphabet[0]

    alphabet_len = len(alphabet) # Cache

    result = ''
    while id_number > 0:
        id_number, mod = divmod(id_number, alphabet_len)
        result = alphabet[mod] + result

    return result

def decode_id(id_string, alphabet=ALPHABET):
    """Convert a string to an integer."""
    alphabet_len = len(alphabet) # Cache
    return sum([alphabet.index(char) * pow(alphabet_len, power) for power, char in enumerate(reversed(id_string))])

It's very fast and can take long integers.

Davide Muzzarelli
  • 929
  • 1
  • 9
  • 13
0

For a similar project, to get a new key, I make a wrapper function around a random string generator that calls the generator until I get a string that hasn't already been used in my hashtable. This method will slow down once your name space starts to get full, but as you have said, even with only 6 characters, you have plenty of namespace to work with.

Joel Berger
  • 20,180
  • 5
  • 49
  • 104
0

Why not just generate a random string and append it to the base URL? This is a very simplified version of doing this in C#.

static string chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
static string baseUrl = "https://google.com/";

private static string RandomString(int length)
{
    char[] s = new char[length];
    Random rnd = new Random();
    for (int x = 0; x < length; x++)
    {
        s[x] = chars[rnd.Next(chars.Length)];
    }
    Thread.Sleep(10);

    return new String(s);
}

Then just add the append the random string to the baseURL:

string tinyURL = baseUrl + RandomString(5);

Remember this is a very simplified version of doing this and it's possible the RandomString method could create duplicate strings. In production you would want to take in account for duplicate strings to ensure you will always have a unique URL. I have some code that takes account for duplicate strings by querying a database table I could share if anyone is interested.

Andrew Reese
  • 854
  • 1
  • 11
  • 27
0

This is my initial thoughts, and more thinking can be done, or some simulation can be made to see if it works well or any improvement is needed:

My answer is to remember the long URL in the database, and use the ID 0 to 9999999999999999 (or however large the number is needed).

But the ID 0 to 9999999999999999 can be an issue, because

  1. it can be shorter if we use hexadecimal, or even base62 or base64. (base64 just like YouTube using A-Z a-z 0-9 _ and -)
  2. if it increases from 0 to 9999999999999999 uniformly, then hackers can visit them in that order and know what URLs people are sending each other, so it can be a privacy issue

We can do this:

  1. have one server allocate 0 to 999 to one server, Server A, so now Server A has 1000 of such IDs. So if there are 20 or 200 servers constantly wanting new IDs, it doesn't have to keep asking for each new ID, but rather asking once for 1000 IDs
  2. for the ID 1, for example, reverse the bits. So 000...00000001 becomes 10000...000, so that when converted to base64, it will be non-uniformly increasing IDs each time.
  3. use XOR to flip the bits for the final IDs. For example, XOR with 0xD5AA96...2373 (like a secret key), and the some bits will be flipped. (whenever the secret key has the 1 bit on, it will flip the bit of the ID). This will make the IDs even harder to guess and appear more random

Following this scheme, the single server that allocates the IDs can form the IDs, and so can the 20 or 200 servers requesting the allocation of IDs. The allocating server has to use a lock / semaphore to prevent two requesting servers from getting the same batch (or if it is accepting one connection at a time, this already solves the problem). So we don't want the line (queue) to be too long for waiting to get an allocation. So that's why allocating 1000 or 10000 at a time can solve the issue.

nonopolarity
  • 146,324
  • 131
  • 460
  • 740