108

How would you convert an integer to base 62 (like hexadecimal, but with these digits: '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ').

I have been trying to find a good Python library for it, but they all seems to be occupied with converting strings. The Python base64 module only accepts strings and turns a single digit into four characters. I was looking for something akin to what URL shorteners use.

martineau
  • 119,623
  • 25
  • 170
  • 301
mikl
  • 23,749
  • 20
  • 68
  • 89
  • Sounds like someone just found an open source project idea :) Let me know if you find anything or decide to create your own... – samoz Jul 13 '09 at 14:24
  • 1
    If you want to create short URLs, you might want to use the whole set of characters which don't need to be encoded: http://en.wikipedia.org/wiki/Percent-encoding#Types_of_URI_characters. That's 66 characters. – l0b0 Jul 13 '09 at 14:32
  • I think I'll pass on the dot and the tilde, just to avoid user confusion, but the dash and the underscores should be worthwhile additions, thanks. – mikl Jul 13 '09 at 14:45
  • what about Base64? You might have better luck finding libraries for that. – Mike Cooper Jul 14 '09 at 04:12
  • This question has a number of applicable answers: http://stackoverflow.com/questions/561486/how-to-convert-an-integer-to-the-shortest-url-safe-string-in-python/ – Miles Jul 14 '09 at 04:14
  • @Mike Cooper: Base 64 is not optimised for numbers, and thus not really applicable for this use case. @Miles: Yes, the baseconverter class that Simon Willison posted at http://www.djangosnippets.org/snippets/1431/ seems like a worthy competitor. I might do a benchmark some time to find the most efficient one :) – mikl Jul 14 '09 at 09:42
  • If you're attempting to write a URL shortener, https://code.google.com/p/python-mom/source/browse/mom/codec/base58.py provides a pretty good implementation of a base58 codec that works with both Python 2.5+ and Python 3.0. There is quite a few documentation in there explaining why base58 is appropriate. Hope this helps. – yesudeep Aug 09 '11 at 14:22
  • I made my js version into an open source project, check it out: https://github.com/sbussard/encode-the-things – Stephen Nov 19 '12 at 05:44
  • possible duplicate of [Is there a good python library that can turn numbers into their respective "symbols"?](http://stackoverflow.com/questions/4351467/is-there-a-good-python-library-that-can-turn-numbers-into-their-respective-symb) – André Laszlo Mar 13 '14 at 10:26
  • I have a Python library for doing exactly that here: [http://www.djangosnippets.org/snippets/1431/](http://www.djangosnippets.org/snippets/1431/) – Simon Willison Sep 28 '09 at 10:59
  • This might help: https://github.com/suminb/base62 – sbbs Jul 02 '19 at 18:47
  • Be careful with your definition of Base62. The internet has quite a few definitions and there is no apparent standard. Wikipedia shows A-Za-z0-9. Many answers here show 0-9a-zA-Z and I have seen 0-9A-Za-z. If you are passing the data on to people make sure you tell them your definition. – Richard B Jun 21 '21 at 10:15

23 Answers23

195

There is no standard module for this, but I have written my own functions to achieve that.

BASE62 = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

def encode(num, alphabet):
    """Encode a positive number into Base X and return the string.

    Arguments:
    - `num`: The number to encode
    - `alphabet`: The alphabet to use for encoding
    """
    if num == 0:
        return alphabet[0]
    arr = []
    arr_append = arr.append  # Extract bound-method for faster access.
    _divmod = divmod  # Access to locals is faster.
    base = len(alphabet)
    while num:
        num, rem = _divmod(num, base)
        arr_append(alphabet[rem])
    arr.reverse()
    return ''.join(arr)

def decode(string, alphabet=BASE62):
    """Decode a Base X encoded string into the number

    Arguments:
    - `string`: The encoded string
    - `alphabet`: The alphabet to use for decoding
    """
    base = len(alphabet)
    strlen = len(string)
    num = 0

    idx = 0
    for char in string:
        power = (strlen - (idx + 1))
        num += alphabet.index(char) * (base ** power)
        idx += 1

    return num

Notice the fact that you can give it any alphabet to use for encoding and decoding. If you leave the alphabet argument out, you are going to get the 62 character alphabet defined on the first line of code, and hence encoding/decoding to/from 62 base.

PS - For URL shorteners, I have found that it's better to leave out a few confusing characters like 0Ol1oI etc. Thus I use this alphabet for my URL shortening needs - "23456789abcdefghijkmnpqrstuvwxyzABCDEFGHJKLMNPQRSTUVWXYZ"

starball
  • 20,030
  • 7
  • 43
  • 238
Baishampayan Ghose
  • 19,928
  • 10
  • 56
  • 60
  • 6
    +1: Nice! This can be extended with more URL-friendly characters to possibly save one character here and there. Characters I know are safe are: `$-_.+!*'(),;/?:@&=` You can probably use some other characters too like `[]~` etc. – Blixt Jul 13 '09 at 14:32
  • Oops, I think I'll change it to return 0 if num <= 0 :) – mikl Jul 13 '09 at 14:47
  • 25
    Naming bug: it's not base 62, since the alphabet is customizable. – unwind Sep 28 '09 at 14:24
  • 3
    For the decode, it's a better habit not to compute the powers (saves time, is shorter to write, but more importantly avoids off-by-one errors), thus: num=0; for char in string: num = num*base + alphabet.index(char) – ShreevatsaR Sep 28 '09 at 14:31
  • 1
    @ShreevatsaR: any particular reason for using str.index() instead of a dictionary lookup? See my answer ... – John Machin Oct 05 '09 at 23:47
  • 1
    I'm not sure about python, but you are likely going to be limited by the range of an integer. It takes 6 characters at base 62 to encode up to 2^31. `ceil( ln(2^31)/ln(62) ) == 6` When you go to decode more than 6 characters you will encounter an overflow. I haven't tried an implementation without that limitation, but I'm sure it's possible. – Jonathan Swinney Nov 11 '10 at 17:17
  • 2
    Jonathan - Python can handle numbers of arbitrary length - there is no overflow: >>> 256 * (62 ** 100) 44402652562862911414971048359760030835982580330786570771137804709455598239929932673552190201125730101070867075377228748911717860448985185350731601887476350502973424822800696272224256L – Anthony Briggs Feb 03 '11 at 05:24
  • @wuub: Good catch, however this can fairly easily be modified to handle negative numbers by prefixing a minus sign to the result when encoding and checking for a leading one in the decoder (since the alphabet doesn't include the `-` character). – martineau Jan 18 '12 at 20:02
  • 1
    The ascii value of upper-case letters is lower than ascii value of lower-case letters (i.e `'A' < 'a'`). If you want to keep alphabetical order between your original string and your encoded string, this alphabet is better : `"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"` – Caduchon Jan 04 '18 at 12:23
  • 1
    This function isn't reversible when your string is made up of zeros. `decode('0') == decode('00000')` – slang Jan 31 '19 at 06:59
  • `import string` and one can write `BASE62 = string.digits + string.ascii_letters` – Serge Stroobandt Oct 09 '19 at 20:23
  • 1
    @slang: Needn't be all zeros: `decode('042') == decode('0042') == decode('00042')` — which can lead to (pun!) very subtle and hard to debug problems. – martineau Jul 01 '20 at 12:45
  • Judging by your efficiency tricks related to `list.append` and `divmod`, makes little sense that you execute `list.reverse` afterwards judging by it's O(N) complexity when there are surely other ways to achieve the same result. Same goes for `str.join`. In any case, is there a reason that you don't set `Base62` as `encode`'s `alphabet` default parameter? – loko Jul 26 '21 at 20:49
61

I once wrote a script to do this aswell, I think it's quite elegant :)

import string
# Remove the `_@` below for base62, now it has 64 characters
BASE_LIST = string.digits + string.letters + '_@'
BASE_DICT = dict((c, i) for i, c in enumerate(BASE_LIST))

def base_decode(string, reverse_base=BASE_DICT):
    length = len(reverse_base)
    ret = 0
    for i, c in enumerate(string[::-1]):
        ret += (length ** i) * reverse_base[c]

    return ret

def base_encode(integer, base=BASE_LIST):
    if integer == 0:
        return base[0]

    length = len(base)
    ret = ''
    while integer != 0:
        ret = base[integer % length] + ret
        integer /= length

    return ret

Example usage:

for i in range(100):                                    
    print i, base_decode(base_encode(i)), base_encode(i)
Wolph
  • 78,177
  • 11
  • 137
  • 148
  • 9
    This version is considerably faster than the accepted solution from Baishampayan. I optimized further by computing length outside of the function. Testing results (100,000 iterations): version-WoLpH: .403 .399 .399 .398 .398 | version-Baishampayan: 1.783 1.785 1.782 1.788 1.784. This version is approximately 4x as fast. – Jordan Apr 28 '11 at 13:46
  • if use `reversed(string)` more faster than slicing `string[::-1]` in the base_decode function. – ENDOH takanao Jan 25 '14 at 03:18
  • 1
    Took me a long time to find this question. Never knew this was called base62 conversion. Nice answer. –  Feb 05 '16 at 09:49
  • 4
    I had to change `integer /= length` to `integer //=length` to get the correct remainder – karlgold Jul 27 '20 at 00:29
11

If you're looking for the highest efficiency (like django), you'll want something like the following. This code is a combination of efficient methods from Baishampayan Ghose and WoLpH and John Machin.

# Edit this list of characters as desired.
BASE_ALPH = tuple("0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz")
BASE_DICT = dict((c, v) for v, c in enumerate(BASE_ALPH))
BASE_LEN = len(BASE_ALPH)

def base_decode(string):
    num = 0
    for char in string:
        num = num * BASE_LEN + BASE_DICT[char]
    return num

def base_encode(num):
    if not num:
        return BASE_ALPH[0]

    encoding = ""
    while num:
        num, rem = divmod(num, BASE_LEN)
        encoding = BASE_ALPH[rem] + encoding
    return encoding

You may want to also calculate your dictionary in advance. (Note: Encoding with a string shows more efficiency than with a list, even with very long numbers.)

>>> timeit.timeit("for i in xrange(1000000): base.base_decode(base.base_encode(i))", setup="import base", number=1)
2.3302059173583984

Encoded and decoded 1 million numbers in under 2.5 seconds. (2.2Ghz i7-2670QM)

Sepero
  • 4,489
  • 1
  • 28
  • 23
  • One does not necessarily need the `tuple()` around `BASE_ALPH` in the beginning. In Python every String is iterable. That feature is of course exploited by `enumerate()`. So the code gets even leaner :) – Luis Nell Apr 18 '13 at 16:05
  • 7
    Hey origiNell, you're right that the tuple() isn't needed, but on my system, it makes the code run about 20% faster. Try testing it out without the tuple() and see what works best for you. Cheers :) – Sepero Apr 24 '13 at 07:20
  • 1
    Interesting point. Makes total sense since tuples are more lightweight than strings. Thanks for the enlightenment :)! – Luis Nell Apr 25 '13 at 08:44
  • @Sepero I further improved your version in terms of formatting, naming, tests and functionality (negative numbers are supported): http://pastebin.com/4uket7iu (you may update your answer with this) – Joschua Aug 12 '14 at 15:15
  • @Joschua -- Your code at your URL didn't work for me. base_encode() seemed to only generate one encoded digit for the numbers I tested. – SMGreenfield May 26 '15 at 04:03
  • @Joschua -- I tried all sorts of values. I had copied your code exactly -- no modifications. This was on Python 2.6.1, running under Mac OS X. – SMGreenfield May 29 '15 at 23:25
  • @SMGreenfield Hmm, could have to do with your version. (FYI 2.6.1 was released on December 4th, 2008) Maybe try it with Python 2.7.10 or Python 3.4.3. – Joschua May 30 '15 at 09:07
  • @Joschua -- very well could be the problem. If it's working for you, then it must be at my end. Thanks. – SMGreenfield May 30 '15 at 18:15
  • Benchmark added as a new answer confirms this is the fastest. – dawid Feb 27 '21 at 20:49
10

The following decoder-maker works with any reasonable base, has a much tidier loop, and gives an explicit error message when it meets an invalid character.

def base_n_decoder(alphabet):
    """Return a decoder for a base-n encoded string
    Argument:
    - `alphabet`: The alphabet used for encoding
    """
    base = len(alphabet)
    char_value = dict(((c, v) for v, c in enumerate(alphabet)))
    def f(string):
        num = 0
        try:
            for char in string:
                num = num * base + char_value[char]
        except KeyError:
            raise ValueError('Unexpected character %r' % char)
        return num
    return f

if __name__ == "__main__":
    func = base_n_decoder('0123456789abcdef')
    for test in ('0', 'f', '2020', 'ffff', 'abqdef'):
        print test
        print func(test)
John Machin
  • 81,303
  • 11
  • 141
  • 189
  • While I would probably never use this, I had too give you a thumbs up for creativity. This code gave me a laugh. :) – Sepero Jan 10 '13 at 13:07
  • @Sepero: What's so funny? It's serious robust industrial-strength software. No Micky-Mouse reversing with a `**` operator in the loop. – John Machin Jan 15 '13 at 11:32
  • Calm yourself friend. You're right. I missed the true goodness of your inner loop due to it being buried within stuff that is unrelated to the question (wrapping, error checking, unit testing). – Sepero Jan 15 '13 at 18:26
  • 1
    Looks good, but haven't you forgotten an "industrial-strength" encoder which takes an integer plus alphabet to produce a string? – martineau Jan 17 '13 at 02:27
  • 1
    Was the q in the last value intentional to show off the ValueError being raised? – Thomas Vander Stichele Aug 14 '14 at 14:15
9

If you use django framework, you can use django.utils.baseconv module.

>>> from django.utils import baseconv
>>> baseconv.base62.encode(1234567890)
1LY7VK

In addition to base62, baseconv also defined base2/base16/base36/base56/base64.

Ryan Fau
  • 1,096
  • 9
  • 12
  • Good to know, it was moved recently from django utils to core signing : see https://code.djangoproject.com/ticket/32712?cversion=0&cnum_hist=1 and https://github.com/django/django/blob/main/django/core/signing.py – Laurent Lyaudet Aug 23 '22 at 08:16
4

If all you need is to generate a short ID (since you mention URL shorteners) rather than encode/decode something, this module might help:

https://github.com/stochastic-technologies/shortuuid/

Stavros Korokithakis
  • 4,680
  • 10
  • 35
  • 42
  • I am not sure that is appropriate for short URLs. A UUID is usually a very large number, so even base57 encoding it like he does is bound to be rather long for a short URL. – mikl Jan 11 '11 at 14:55
  • You can just cut as much as you want, collisions will still be unlikely as it's purely random, but won't be a unique id any more. – Stavros Korokithakis Jan 21 '11 at 23:09
3

You probably want base64, not base62. There's an URL-compatible version of it floating around, so the extra two filler characters shouldn't be a problem.

The process is fairly simple; consider that base64 represents 6 bits and a regular byte represents 8. Assign a value from 000000 to 111111 to each of the 64 characters chosen, and put the 4 values together to match a set of 3 base256 bytes. Repeat for each set of 3 bytes, padding at the end with your choice of padding character (0 is generally useful).

Williham Totland
  • 28,471
  • 6
  • 52
  • 68
  • 5
    The standard Python base64 encoding methods are not really suitable for short URLs, since it is optimized for encoding bytes (ie. strings/letters), and will produce longer outputs than just base-shifting the numerical value. – mikl Apr 02 '10 at 15:34
  • @mikl Of course, Python's base64 module may not be suitable for generating short URLs, but all of Python's encoding methods are really working on base-256 number sequences. bytes are really base-256 encoded "strings". Python 2.x treats strings as a sequence of bytes, whereas Python 3.x (which does the right thing) treats strings as Unicode. So b'foobar' is really only a fancy way of writing [102, 111, 111, 98, 97, 114] or [0x66,0x6f,0x6f,0x62,0x61,0x72] or b'\x66\x6f\x6f\x62\x61\x72' which unsurprisingly is the base-256 representation. Bytes are not strings or letters. Bytes are bytes. =) – yesudeep Aug 09 '11 at 14:19
  • @yesudeep: So, bytes are bytes…and what exactly is your point? – martineau Jan 17 '13 at 01:29
3

There is now a python library for this.

I'm working on making a pip package for this.

I recommend you use my bases.py https://github.com/kamijoutouma/bases.py which was inspired by bases.js

from bases import Bases
bases = Bases()

bases.toBase16(200)                // => 'c8'
bases.toBase(200, 16)              // => 'c8'
bases.toBase62(99999)              // => 'q0T'
bases.toBase(200, 62)              // => 'q0T'
bases.toAlphabet(300, 'aAbBcC')    // => 'Abba'

bases.fromBase16('c8')               // => 200
bases.fromBase('c8', 16)             // => 200
bases.fromBase62('q0T')              // => 99999
bases.fromBase('q0T', 62)            // => 99999
bases.fromAlphabet('Abba', 'aAbBcC') // => 300

refer to https://github.com/kamijoutouma/bases.py#known-basesalphabets for what bases are usable

Belldandu
  • 2,176
  • 1
  • 15
  • 16
2

you can download zbase62 module from pypi

eg

>>> import zbase62
>>> zbase62.b2a("abcd")
'1mZPsa'
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
2

I hope the following snippet could help.

def num2sym(num, sym, join_symbol=''):
    if num == 0:
        return sym[0]
    if num < 0 or type(num) not in (int, long):
        raise ValueError('num must be positive integer')

    l = len(sym)  # target number base
    r = []
    div = num
    while div != 0: # base conversion
        div, mod = divmod(div, l)
        r.append(sym[mod])

    return join_symbol.join([x for x in reversed(r)])

Usage for your case:

number = 367891
alphabet = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
print num2sym(number, alphabet)  # will print '1xHJ'

Obviously, you can specify another alphabet, consisting of lesser or greater number of symbols, then it will convert your number to the lesser or greater number base. For example, providing '01' as an alphabet will output string representing input number as binary.

You may shuffle the alphabet initially to have your unique representation of the numbers. It can be helpful if you're making URL shortener service.

Vladimir Ignatev
  • 2,054
  • 1
  • 20
  • 34
  • 1
    Not bad. You might want to use `if num < 0 or type(num) not in (int, long):`. – martineau Jun 25 '13 at 15:28
  • That's better, but it's a little more complicated because `long` doesn't exist in Py 3.x -- so one might want to use [this answer](http://stackoverflow.com/a/3646519/355230). – martineau Jun 25 '13 at 18:25
  • 1
    _Or_ use my own portable version: `isinstance(x, (type(1), type(2**32)))`. – martineau Jun 25 '13 at 19:35
2

Python does not have a built-in solution. The chosen solution is probably the most readable one, but we might be able to scrap a bit of performance.

from string import digits, ascii_lowercase, ascii_uppercase

base_chars = digits + ascii_lowercase + ascii_uppercase


def base_it(number, base=62):
    def iterate(moving_number=number, moving_base=base):
        if not moving_number:
            return ''
        return iterate(moving_number // moving_base, moving_base * base) + base_chars[moving_number % base]

    return iterate() or base_chars[0]

Explanation

In any base every number is equal to a1 + a2*base**2 + a3*base**3... So the goal is to find all the as.

For every N=1,2,3... the code isolates the aN*base**N by "modulo" by base for base = base**(N+1) which slices all numbers bigger than N, and slicing all the numbers so that their serial is smaller than N by decreasing a every time the function is called recursively by the current aN*base**N.


Advantages and discussion

In this sample, there's only one multiplication (instead of a division) and some modulus operations, which are all relatively fast.

If you really want performance, though, you'd probably do better of using a CPython library.

Shu ba
  • 334
  • 1
  • 10
2

I have benefited greatly from others' posts here. I needed the python code originally for a Django project, but since then I have turned to node.js, so here's a javascript version of the code (the encoding part) that Baishampayan Ghose provided.

var ALPHABET = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";

function base62_encode(n, alpha) {
  var num = n || 0;
  var alphabet = alpha || ALPHABET;

  if (num == 0) return alphabet[0];
  var arr = [];
  var base = alphabet.length;

  while(num) {
    rem = num % base;
    num = (num - rem)/base;
    arr.push(alphabet.substring(rem,rem+1));
  }

  return arr.reverse().join('');
}

console.log(base62_encode(2390687438976, "123456789ABCDEFGHIJKLMNPQRSTUVWXYZ"));
Stephen
  • 7,994
  • 9
  • 44
  • 73
  • I've updated this code and made it into an open source project for anyone who is interested https://github.com/sbussard/encode-the-things – Stephen Nov 19 '12 at 05:45
2

Simplest ever.

BASE62 = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
def encode_base62(num):
    s = ""
    while num>0:
      num,r = divmod(num,62)
      s = BASE62[r]+s
    return s


def decode_base62(num):
   x,s = 1,0
   for i in range(len(num)-1,-1,-1):
      s = int(BASE62.index(num[i])) *x + s
      x*=62
   return s

print(encode_base62(123))
print(decode_base62("1Z"))
melvil james
  • 592
  • 7
  • 18
1

Personally I like the solution from Baishampayan, mostly because of stripping the confusing characters.

For completeness, and solution with better performance, this post shows a way to use the Python base64 module.

Van Gale
  • 43,536
  • 9
  • 71
  • 81
  • 1
    As mentioned in my comment to Williham Totland, Pythons base64 is suboptimal for encoding numbers, since it is optimized for strings. – mikl Apr 02 '10 at 15:37
1
BASE_LIST = tuple("23456789ABCDEFGHJKLMNOPQRSTUVWXYZabcdefghjkmnpqrstuvwxyz")
BASE_DICT = dict((c, v) for v, c in enumerate(BASE_LIST))
BASE_LEN = len(BASE_LIST)

def nice_decode(str):
    num = 0
    for char in str[::-1]:
        num = num * BASE_LEN + BASE_DICT[char]
    return num

def nice_encode(num):
    if not num:
        return BASE_LIST[0]

    encoding = ""
    while num:
        num, rem = divmod(num, BASE_LEN)
        encoding += BASE_LIST[rem]
    return encoding
paulkav1
  • 449
  • 3
  • 4
  • 1
    This fixes the name of BASE_LIST and also reverses the string on decoding which was omitted in Spero's otherwise excellent answer – paulkav1 Mar 29 '13 at 00:51
1

Here is an recurive and iterative way to do that. The iterative one is a little faster depending on the count of execution.

def base62_encode_r(dec):
    s = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
    return s[dec] if dec < 62 else base62_encode_r(dec / 62) + s[dec % 62]
print base62_encode_r(2347878234)

def base62_encode_i(dec):
    s = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
    ret = ''
    while dec > 0:
        ret = s[dec % 62] + ret
        dec /= 62
    return ret
print base62_encode_i(2347878234)

def base62_decode_r(b62):
    s = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
    if len(b62) == 1:
        return s.index(b62)
    x = base62_decode_r(b62[:-1]) * 62 + s.index(b62[-1:]) % 62
    return x
print base62_decode_r("2yTsnM")

def base62_decode_i(b62):
    s = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
    ret = 0
    for i in xrange(len(b62)-1,-1,-1):
        ret = ret + s.index(b62[i]) * (62**(len(b62)-i-1))
    return ret
print base62_decode_i("2yTsnM")

if __name__ == '__main__':
    import timeit
    print(timeit.timeit(stmt="base62_encode_r(2347878234)", setup="from __main__ import base62_encode_r", number=100000))
    print(timeit.timeit(stmt="base62_encode_i(2347878234)", setup="from __main__ import base62_encode_i", number=100000))
    print(timeit.timeit(stmt="base62_decode_r('2yTsnM')", setup="from __main__ import base62_decode_r", number=100000))
    print(timeit.timeit(stmt="base62_decode_i('2yTsnM')", setup="from __main__ import base62_decode_i", number=100000))

0.270266867033
0.260915645986
0.344734796766
0.311662500262
wenzul
  • 3,948
  • 2
  • 21
  • 33
  • I really liked your recursive approach. My daughter, who was taking AP Comp Sci, had figured out this same solution for me to implement a "base25" (using 'ABCDEFHJKMNPQRTUVWXY34789') in C++. I went to convert it to Python and being a total newb with that language hit a few stumbling blocks -- which you elegantly solved in a single line of code! You even avoid a common issue with 0 translating to an empty string in alphabets that don't begin with 0-9. Great work! (I don't need negative numbers, but your approach was so good it might be nice to add that for future browsers) – SMGreenfield May 26 '15 at 03:58
1

I wrote this a while back and it's worked pretty well (negatives and all included)

def code(number,base):
    try:
        int(number),int(base)
    except ValueError:
        raise ValueError('code(number,base): number and base must be in base10')
    else:
        number,base = int(number),int(base)
    if base < 2:
        base = 2
    if base > 62:
        base = 62
    numbers = [0,1,2,3,4,5,6,7,8,9,"a","b","c","d","e","f","g","h","i","j",
               "k","l","m","n","o","p","q","r","s","t","u","v","w","x","y",
               "z","A","B","C","D","E","F","G","H","I","J","K","L","M","N",
               "O","P","Q","R","S","T","U","V","W","X","Y","Z"]
    final = ""
    loc = 0
    if number < 0:
        final = "-"
        number = abs(number)
    while base**loc <= number:
        loc = loc + 1
    for x in range(loc-1,-1,-1):
        for y in range(base-1,-1,-1):
            if y*(base**x) <= number:
                final = "{}{}".format(final,numbers[y])
                number = number - y*(base**x)
                break
    return final

def decode(number,base):
    try:
        int(base)
    except ValueError:
        raise ValueError('decode(value,base): base must be in base10')
    else:
        base = int(base)
    number = str(number)
    if base < 2:
        base = 2
    if base > 62:
        base = 62
    numbers = ["0","1","2","3","4","5","6","7","8","9","a","b","c","d","e","f",
               "g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v",
               "w","x","y","z","A","B","C","D","E","F","G","H","I","J","K","L",
               "M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z"]
    final = 0
    if number.startswith("-"):
        neg = True
        number = list(number)
        del(number[0])
        temp = number
        number = ""
        for x in temp:
            number = "{}{}".format(number,x)
    else:
        neg = False
    loc = len(number)-1
    number = str(number)
    for x in number:
        if numbers.index(x) > base:
            raise ValueError('{} is out of base{} range'.format(x,str(base)))
        final = final+(numbers.index(x)*(base**loc))
        loc = loc - 1
    if neg:
        return -final
    else:
        return final

sorry about the length of it all

martineau
  • 119,623
  • 25
  • 170
  • 301
Thropian
  • 11
  • 1
1

Python 3.7.x

I found a PhD's github for some algorithms when looking for an existing base62 script. It didn't work for the current max-version of Python 3 at this time so I went ahead and fixed where needed and did a little refactoring. I don't usually work with Python and have always used it ad-hoc so YMMV. All credit goes to Dr. Zhihua Lai. I just worked the kinks out for this version of Python.

file base62.py

#modified from Dr. Zhihua Lai's original on GitHub
from math import floor
base = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';
b = 62;
def toBase10(b62: str) -> int:
    limit = len(b62)
    res = 0
    for i in range(limit):
        res = b * res + base.find(b62[i])
    return res
def toBase62(b10: int) -> str:
    if b <= 0 or b > 62:
        return 0
    r = b10 % b
    res = base[r];
    q = floor(b10 / b)
    while q:
        r = q % b
        q = floor(q / b)
        res = base[int(r)] + res
    return res

file try_base62.py

import base62
print("Base10 ==> Base62")
for i in range(999):
    print(f'{i} => {base62.toBase62(i)}')
base62_samples = ["gud", "GA", "mE", "lo", "lz", "OMFGWTFLMFAOENCODING"]
print("Base62 ==> Base10")
for i in range(len(base62_samples)):
    print(f'{base62_samples[i]} => {base62.toBase10(base62_samples[i])}')

output of try_base62.py

Base10 ==> Base62
0 => 0
[...]
998 => g6
Base62 ==> Base10
gud => 63377
GA => 2640
mE => 1404
lo => 1326
lz => 1337
OMFGWTFLMFAOENCODING => 577002768656147353068189971419611424

Since there was no licensing info in the repo I did submit a PR so the original author at least knows other people are using and modifying their code.

kayleeFrye_onDeck
  • 6,648
  • 5
  • 69
  • 80
1

In all solutions above they define the alphabet itself when in reality it's already available using the ASCII codes.

def converter_base62(count) -> str:
   result = ''
   start = ord('0')
   while count > 0:
      result = chr(count % 62 + start) + result
      count //= 62
   return result


def decode_base62(string_to_decode: str):
    result = 0
    start = ord('0')
    for char in string_to_decode:
        result = result * 62 + (ord(char)-start)
    return result

import tqdm

n = 10_000_000

for i in tqdm.tqdm(range(n)):
    assert decode_base62(converter_base62(i)) == i
0

Sorry, I can't help you with a library here. I would prefer using base64 and just adding to extra characters to your choice -- if possible!

Then you can use the base64 module.

If this is really, really not possible:

You can do it yourself this way (this is pseudo-code):

base62vals = []
myBase = 62
while num > 0:
   reminder = num % myBase
   num = num / myBase
   base62vals.insert(0, reminder)
Juergen
  • 12,378
  • 7
  • 39
  • 55
0

with simple recursion

"""
This module contains functions to transform a number to string and vice-versa
"""
BASE = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
LEN_BASE = len(BASE)


def encode(num):
    """
    This function encodes the given number into alpha numeric string
    """

    if num < LEN_BASE:
        return BASE[num]

    return BASE[num % LEN_BASE] + encode(num//LEN_BASE)


def decode_recursive(string, index):
    """
    recursive util function for decode
    """

    if not string or index >= len(string):
        return 0

    return (BASE.index(string[index]) * LEN_BASE ** index) + decode_recursive(string, index + 1)


def decode(string):
    """
    This function decodes given string to number
    """

    return decode_recursive(string, 0)

Lokesh Sanapalli
  • 1,012
  • 3
  • 18
  • 39
0

Benchmarking answers that worked for Python3 (machine: i7-8565U):

"""
us per enc()+dec()  #  test

(4.477935791015625, 2, '3Tx16Db2JPSS4ZdQ4dp6oW')
(6.073190927505493, 5, '3Tx16Db2JPSS4ZdQ4dp6oW')
(9.051250696182251, 9, '3Tx16Db2JPSS4ZdQ4dp6oW')
(9.864609956741333, 6, '3Tx16Db2JOOqeo6GCGscmW')
(10.868197917938232, 1, '3Tx16Db2JPSS4ZdQ4dp6oW')
(11.018349647521973, 10, '3Tx16Db2JPSS4ZdQ4dp6oW')
(12.448230504989624, 4, '03Tx16Db2JPSS4ZdQ4dp6oW')
(13.016672611236572, 7, '3Tx16Db2JPSS4ZdQ4dp6oW')
(13.212724447250366, 8, '3Tx16Db2JPSS4ZdQ4dp6oW')
(24.119479656219482, 3, '3tX16dB2jpss4zDq4DP6Ow')
"""

from time import time

half = 2 ** 127
results = []


def bench(n, enc, dec):
    start = time()
    for i in range(half, half + 1_000_000):
        dec(enc(i))
    end = time()
    results.append(tuple([end - start, n, enc(half + 1234134134134314)]))


BASE62 = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"


def encode(num, alphabet=BASE62):
    """Encode a positive number into Base X and return the string.

    Arguments:
    - `num`: The number to encode
    - `alphabet`: The alphabet to use for encoding
    """
    if num == 0:
        return alphabet[0]
    arr = []
    arr_append = arr.append  # Extract bound-method for faster access.
    _divmod = divmod  # Access to locals is faster.
    base = len(alphabet)
    while num:
        num, rem = _divmod(num, base)
        arr_append(alphabet[rem])
    arr.reverse()
    return ''.join(arr)


def decode(string, alphabet=BASE62):
    """Decode a Base X encoded string into the number

    Arguments:
    - `string`: The encoded string
    - `alphabet`: The alphabet to use for decoding
    """
    base = len(alphabet)
    strlen = len(string)
    num = 0

    idx = 0
    for char in string:
        power = (strlen - (idx + 1))
        num += alphabet.index(char) * (base ** power)
        idx += 1

    return num


bench(1, encode, decode)
###########################################################################################################
# Remove the `_@` below for base62, now it has 64 characters
BASE_ALPH = tuple(BASE62)
BASE_LIST = BASE62
BASE_DICT = dict((c, v) for v, c in enumerate(BASE_ALPH))

###########################################################################################################
BASE_LEN = len(BASE_ALPH)


def decode(string):
    num = 0
    for char in string:
        num = num * BASE_LEN + BASE_DICT[char]
    return num


def encode(num):
    if not num:
        return BASE_ALPH[0]

    encoding = ""
    while num:
        num, rem = divmod(num, BASE_LEN)
        encoding = BASE_ALPH[rem] + encoding
    return encoding


bench(2, encode, decode)

###########################################################################################################
from django.utils import baseconv

bench(3, baseconv.base62.encode, baseconv.base62.decode)


###########################################################################################################
def encode(a):
    baseit = (lambda a=a, b=62: (not a) and '0' or
                                baseit(a - a % b, b * 62) + '0123456789abcdefghijklmnopqrstuvwxyz'
                                                            'ABCDEFGHIJKLMNOPQRSTUVWXYZ'[
                                    a % b % 61 or -1 * bool(a % b)])
    return baseit()


bench(4, encode, decode)


###########################################################################################################
def encode(num, sym=BASE62, join_symbol=''):
    if num == 0:
        return sym[0]

    l = len(sym)  # target number base
    r = []
    div = num
    while div != 0:  # base conversion
        div, mod = divmod(div, l)
        r.append(sym[mod])

    return join_symbol.join([x for x in reversed(r)])


bench(5, encode, decode)

###########################################################################################################
from math import floor

base = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';
b = 62;


def decode(b62: str) -> int:
    limit = len(b62)
    res = 0
    for i in range(limit):
        res = b * res + base.find(b62[i])
    return res


def encode(b10: int) -> str:
    if b <= 0 or b > 62:
        return 0
    r = b10 % b
    res = base[r];
    q = floor(b10 / b)
    while q:
        r = q % b
        q = floor(q / b)
        res = base[int(r)] + res
    return res


bench(6, encode, decode)


###########################################################################################################
def encode(dec):
    s = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
    return s[dec] if dec < 62 else encode(dec // 62) + s[int(dec % 62)]


def decode(b62):
    s = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
    if len(b62) == 1:
        return s.index(b62)
    x = decode(b62[:-1]) * 62 + s.index(b62[-1:]) % 62
    return x


bench(7, encode, decode)


def encode(dec):
    s = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
    ret = ''
    while dec > 0:
        ret = s[dec % 62] + ret
        dec //= 62
    return ret


def decode(b62):
    s = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
    ret = 0
    for i in range(len(b62) - 1, -1, -1):
        ret = ret + s.index(b62[i]) * (62 ** (len(b62) - i - 1))
    return ret


bench(8, encode, decode)


###########################################################################################################

def encode(num):
    s = ""
    while num > 0:
        num, r = divmod(num, 62)
        s = BASE62[r] + s
    return s


def decode(num):
    x, s = 1, 0
    for i in range(len(num) - 1, -1, -1):
        s = int(BASE62.index(num[i])) * x + s
        x *= 62
    return s


bench(9, encode, decode)


###########################################################################################################

def encode(number: int, alphabet=BASE62, padding: int = 22) -> str:
    l = len(alphabet)
    res = []
    while number > 0:
        number, rem = divmod(number, l)
        res.append(alphabet[rem])
        if number == 0:
            break
    return "".join(res)[::-1]  # .rjust(padding, "0")


def decode(digits: str, lookup=BASE_DICT) -> int:
    res = 0
    last = len(digits) - 1
    base = len(lookup)
    for i, d in enumerate(digits):
        res += lookup[d] * pow(base, last - i)
    return res


bench(10, encode, decode)

###########################################################################################################

for row in sorted(results):
    print(row)
dawid
  • 663
  • 6
  • 12
0

Original javascript version:

var hash = "", alphabet = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ", alphabetLength = 
alphabet.length;
do {
  hash = alphabet[input % alphabetLength] + hash;
  input = parseInt(input / alphabetLength, 10);
} while (input);

Source: https://hashids.org/

python:

def to_base62(number):
  alphabet = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
  alphabetLength = len(alphabet)
  result = ""
  while True:
    result = alphabet[number % alphabetLength] + result
    number = int(number / alphabetLength)
    if number == 0:
      break
  return result

print to_base62(59*(62**2) + 60*(62) + 61)
# result: XYZ
Curtis Yallop
  • 6,696
  • 3
  • 46
  • 36