2

For a DyanamoDB-backed web app, I need to generate unique, stable URLs that reliably refer to unique rows in the DynamoDB table.

In the past, for a PostgreSQL-backed application, I've had good results from using an auto-incrementing integer as primary key, and using a hashid of the integer:

In [1]: import hashids

In [2]: hasher = hashids.Hashids(min_length=5, alphabet='abcdefghijklmnopqrstuvwxyz0123456789')

In [3]: hasher.encode(12345)
Out[2]: 'e763y'

I would then use this in the URL:

http://example.com/random-mutable-title-e763y/

However, with DynamoDB, there are no auto-incrementing primary keys, and UUIDs are recommended instead.

However, UUIDs contain 128 bits, and the hashid of a UUID is much longer:

In [3]: import uuid

In [4]: hasher.encode(uuid.uuid4().int)
Out[4]: '5j257lmv00xwo5pvo132783jv0qkq'

It's prohibitively long for a URL, or at least plain ugly:

http://example.com/random-mutable-title-5j257lmv00xwo5pvo132783jv0qkq/

I've seen it recommended to simply mask the UUID:

In [5]: hasher.encode((uuid.uuid4().int & (1 << 64) - 1))
Out[5]: 'v0qnq92ml7oj382'

But even that seems a bit long:

http://example.com/random-mutable-title-v0qnq92ml7oj382/

I could saw off more bits:

In [6]: hasher.encode((uuid.uuid4().int & (1 << 32) - 1))
Out[6]: 'lj044pkn'

But that seems a bit dangerous:

In [7]: len(set(uuid.uuid4().int & (1 << 32) - 1 for _ in range(100000)))
Out[7]: 99999

What's the best/safest thing to do here? I don't anticipate a heavy write load to this table, so do I need to break down and implement an auto-incrementing integer scheme with conditional writes?

Update:

I just realized that if I right-shift 32 bits of a UUID1, it seems to be fairly unique:

In [8]: len(set(uuid.uuid1().int >> 32 for _ in range(1000000)))
Out[8]: 1000000

But will this come back to bite me? :D

Update 2:

To answer some questions from comments:

My application will be the only one writing to this table.

The application is written in Python.

The data schema for the table uses a hash key for the user ID and a sort key that varies depending on what's being stored in the row. Let's say I'm storing User records, a user's Projects, and Documents contained within the projects. I'll probably end up having a Global Secondary Index to support queries based on the URL hashid, unless the hashid and the record's primary key end up being equivalent.

Common queries for the table will be:

  1. User by email (for logins) supported by another GSI
  2. All Users (by hash key)
  3. All of a User's Projects (using hash key and sort key beginswith())
  4. A particular Project (supported by the GSI under discussion)
  5. All Documents in a particular Project (hash key and sort key beginswith())
  6. Individual document (supported by the GSI under discussion)
David Eyk
  • 12,171
  • 11
  • 63
  • 103
  • 1) Is your webapp the only application writing to the table? What language are you using? 2) Can you tell us anything about your data schema? Are you sure UUID is the right primary key here? I would suggest its fairly rare that a UUID is a good primary key as it basically means you are only accessing table when you already have the key from somewhere else. – F_SO_K Apr 24 '18 at 09:38
  • I'm not sure that a UUID is the right primary key, which is why I'm asking this question. :) UUIDs *are* commonly used w/ DynamoDB though, because the DB client can safely generate a unique ID w/o coordination. So I would probably still use UUIDs for the hash and sort keys, and rely on GSI to support lookups from the URL slugs. – David Eyk Apr 24 '18 at 15:25

0 Answers0