For a DyanamoDB-backed web app, I need to generate unique, stable URLs that reliably refer to unique rows in the DynamoDB table.
In the past, for a PostgreSQL-backed application, I've had good results from using an auto-incrementing integer as primary key, and using a hashid of the integer:
In [1]: import hashids
In [2]: hasher = hashids.Hashids(min_length=5, alphabet='abcdefghijklmnopqrstuvwxyz0123456789')
In [3]: hasher.encode(12345)
Out[2]: 'e763y'
I would then use this in the URL:
http://example.com/random-mutable-title-e763y/
However, with DynamoDB, there are no auto-incrementing primary keys, and UUIDs are recommended instead.
However, UUIDs contain 128 bits, and the hashid of a UUID is much longer:
In [3]: import uuid
In [4]: hasher.encode(uuid.uuid4().int)
Out[4]: '5j257lmv00xwo5pvo132783jv0qkq'
It's prohibitively long for a URL, or at least plain ugly:
http://example.com/random-mutable-title-5j257lmv00xwo5pvo132783jv0qkq/
I've seen it recommended to simply mask the UUID:
In [5]: hasher.encode((uuid.uuid4().int & (1 << 64) - 1))
Out[5]: 'v0qnq92ml7oj382'
But even that seems a bit long:
http://example.com/random-mutable-title-v0qnq92ml7oj382/
I could saw off more bits:
In [6]: hasher.encode((uuid.uuid4().int & (1 << 32) - 1))
Out[6]: 'lj044pkn'
But that seems a bit dangerous:
In [7]: len(set(uuid.uuid4().int & (1 << 32) - 1 for _ in range(100000)))
Out[7]: 99999
What's the best/safest thing to do here? I don't anticipate a heavy write load to this table, so do I need to break down and implement an auto-incrementing integer scheme with conditional writes?
Update:
I just realized that if I right-shift 32 bits of a UUID1, it seems to be fairly unique:
In [8]: len(set(uuid.uuid1().int >> 32 for _ in range(1000000)))
Out[8]: 1000000
But will this come back to bite me? :D
Update 2:
To answer some questions from comments:
My application will be the only one writing to this table.
The application is written in Python.
The data schema for the table uses a hash key for the user ID and a sort key that varies depending on what's being stored in the row. Let's say I'm storing User records, a user's Projects, and Documents contained within the projects. I'll probably end up having a Global Secondary Index to support queries based on the URL hashid, unless the hashid and the record's primary key end up being equivalent.
Common queries for the table will be:
- User by email (for logins) supported by another GSI
- All Users (by hash key)
- All of a User's Projects (using hash key and sort key
beginswith()
) - A particular Project (supported by the GSI under discussion)
- All Documents in a particular Project (hash key and sort key
beginswith()
) - Individual document (supported by the GSI under discussion)