34

I'm assigning all my MongoDB documents a GUID using uuid.uuid1(). I want a way I can derive an 11 character, unique, case-sensitive YouTube-like ID, such as

1_XmY09uRJ4 

from uuid's resulting hex string which looks like

ae0a0c98-f1e5-11e1-9t2b-1231381dac60

I want to be able to match the shortened ID to the hex and vice-versa, dynamically without having to store another string in the database. Does anyone have some sample code or can point me in the direction of the module or formula that can do this?

zakdances
  • 22,285
  • 32
  • 102
  • 173
  • what "t"? I'm not sure what you're referring to – zakdances Sep 04 '12 at 20:32
  • 1
    @yourfriendzak: your UUID contains a 't', making it invalid. – Martijn Pieters Sep 04 '12 at 20:34
  • hmmm that's odd...I copy and pasted it straight from a uuid.hex output... – zakdances Sep 04 '12 at 20:35
  • 1
    AFAIR, UUIDs have a time component. It's possible your string uses the 't' as a delimiter. – Carlos Jan 02 '14 at 16:44
  • @Carlos UUIDs are never displayed as hex with a time component. The `t` breaks the hex representation of the 9th byte (out of 16); it would normally have been a hex digit. My money is on `f` and the OP misread it as `t`, or they hit the `t` key at some point with that digit selected. At any rate, Python's `uuid.uuid1()` will only ever produce a `uuid.UUID()` instance, and it's hex attribute outputs the 8-4-4-4-12 hex digit representation of the value and will **never** include a `t`. Here, the `t` is part of the 14 bits of the clock sequence value (bits 9-12, when counted from right to left). – Martijn Pieters Jul 14 '20 at 22:49

3 Answers3

69

Convert the underlying bytes to a base64 value, stripping the = padding and the newline.

You probably want to use the base64.urlsafe_b64encode() function to avoid using / and + (_ and - are used instead), so the resulting string can be used as a URL path element:

>>> import uuid, base64
>>> base64.urlsafe_b64encode(uuid.uuid1().bytes).rstrip(b'=').decode('ascii')
'81CMD_bOEeGbPwAjMtYnhg'

The reverse:

>>> uuid.UUID(bytes=base64.urlsafe_b64decode('81CMD_bOEeGbPwAjMtYnhg' + '=='))
UUID('f3508c0f-f6ce-11e1-9b3f-002332d62786')

To turn that into generic functions:

from base64 import urlsafe_b64decode, urlsafe_b64encode
from uuid import UUID

def uuid2slug(uuidstring):
    return urlsafe_b64encode(UUID(uuidstring).bytes).rstrip(b'=').decode('ascii')

def slug2uuid(slug):
    return str(UUID(bytes=urlsafe_b64decode(slug + '==')))

This gives you a method to represent the 16-byte UUID in a more compact form. Compress any further and you loose information, which means you cannot decompress it again to the full UUID. The full range of values that 16 bytes can represent will never fit it anything less than 22 base64 characters, which needs 4 characters for every three bytes of input and every character encodes 6 bits of information.

YouTube's unique string is thus not based on a full 16-byte UUID, their 11 character ids are probably stored in the database for easy lookup and based on a smaller value; e.g. if the value is also a URL-safe base64 string then they encode an 8-byte number.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
2

For those looking specifically for a way to shorten uuids in a url safe way, the really useful answer from @MartijnPieters can be simplified some using the base64 module to handle the characters that are not url safe similar to the comment on that answer from @okoboko (without a few unnecessary bits).

import base64
import uuid

# uuid to b64 string and back
uuid_to_b64str = base64.urlsafe_b64encode(uuid.uuid1().bytes).decode('utf8').rstrip('=\n')
b64str_to_uuid = uuid.UUID(bytes=base64.urlsafe_b64decode(f'{uuid_to_b64str}=='))

# uuid string to b64 string and back
uuidstr_to_b64str = base64.urlsafe_b64encode(uuid.UUID(str(uuid.uuid1())).bytes).decode('utf8').rstrip('=\n')
b64str_to_uuidstr = str(uuid.UUID(bytes=base64.urlsafe_b64decode(f'{uuidstr_to_b64str}==')))
benvc
  • 14,448
  • 4
  • 33
  • 54
  • Careful with `uuid1()` - if called multiple times at the same time on the same system (e.g. in a loop), it will return the same UUID. So you moight want to consider using the randomly generated `uuid4()` function (or not, depending on the use case). – mozzbozz Oct 08 '21 at 21:19
1

You could look into Python's base64 model. A GUID is essentially a base-16 representation of a number, and you could trim out the hyphens, decode from base 16, and encode into base 64. Going in reverse requires decoding from base 64, encoding in base 16, and inserting the hyphens in the appropriate places.

Platinum Azure
  • 45,269
  • 12
  • 110
  • 134