14

I've been browsing the net trying to find a solution that will allow us to generate unique IDs in a regionally distributed environment.

I looked at the following options (among others):

SNOWFLAKE (by Twitter)

  • It seems like a great solutions, but I just don't like the added complexity of having to manage another software just to create IDs;
  • It lacks documentation at this stage, so I don't think it will be a good investment;
  • The nodes need to be able to communicate to one another using Zookeeper (what about latency / communication failure?)

UUID

  • Just look at it: 550e8400-e29b-41d4-a716-446655440000;
  • Its a 128 bit ID;
  • There has been some known collisions (depending on the version I guess) see this post.

AUTOINCREMENT IN RELATIONAL DATABASE LIKE MYSQL

  • This seems safe, but unfortunately, we are not using relational databases (scalability preferences);
  • We could deploy a MySQL server for this like what Flickr does, but again, this introduces another point of failure / bottleneck. Also added complexity.

AUTOINCREMENT IN A NON-RELATIONAL DATABASE LIKE COUCHBASE

  • This could work since we are using Couchbase as our database server, but;
  • This will not work when we have more than one clusters in different regions, latency issues, network failures: At some point, IDs will collide depending on the amount of traffic;

MY PROPOSED SOLUTION (this is what I need help with)

Lets say that we have clusters consisting of 10 Couchbase Nodes and 10 Application nodes in 5 different regions (Africa, Europe, Asia, America and Oceania). This is to ensure that content is served from a location closest to the user (to boost speed) and to ensure redundancy in case of disasters etc.

Now, the task is to generate IDs that wont collide when the replication (and balancing) occurs and I think this can be achieved in 3 steps:

Step 1

All regions will be assigned integer IDs (unique identifiers):

  • 1 - Africa;
  • 2 - America;
  • 3 - Asia;
  • 4 - Europe;
  • 5 - Ociania.

Step 2

Assign an ID to every Application node that is added to the cluster keeping in mind that there may be up to 99 999 servers in one cluster (even though I doubt: just as a safely precaution). This will look something like this (fake IPs):

  • 00001 - 192.187.22.14
  • 00002 - 164.254.58.22
  • 00003 - 142.77.22.45
  • and so forth.

Please note that all of these are in the same cluster, so that means you can have node 00001 per region.

Step 3

For every record inserted into the database, an incremented ID will be used to identify it, and this is how it will work:

Couchbase offers an increment feature that we can use to create IDs internally within the cluster. To ensure redundancy, 3 replicas will be created within the cluster. Since these are in the same place, I think it should be safe to assume that unless the whole cluster is down, one of the nodes responsible for this will be available, otherwise a number of replicas can be increased.

Bringing it all together

Say a user is signing up from Europe: The application node serving the request will grab the region code (4 in this case), get its own ID (say 00005) and then get an incremented ID (1) from Couchbase (from the same cluster).

We end up with 3 components: 4, 00005,1. Now, to create an ID from this, we can just join these components into 4.00005.1. To make it even better (I'm not too sure about this), we can concatenate (not add them up) the components to end up with: 4000051.

In code, this will look something like this:

$id = '4'.'00005'.'1';

NB: Not $id = 4+00005+1;.

Pros

  • IDs look better than UUIDs;
  • They seem unique enough. Even if a node in another region generated the same incremented ID and has the same node ID as the one above, we always have the region code to set them apart;
  • They can still be stored as integers (probably Big Unsigned integers);
  • It's all part of the architecture, no added complexities.

Cons

  • No sorting (or is there)?
  • This is where I need your input (most)

I know that every solution has flaws, and possibly more that what we see on the surface. Can you spot any issues with this whole approach?

Thank you in advance for your help :-)

EDIT

As @DaveRandom suggested, we can add the 4th step:

Step 4

We can just generate a random number and append it to the ID to prevent predictability. Effectively, you end up with something like this:

4000051357 instead of just 4000051.

Community
  • 1
  • 1
Sthe
  • 2,575
  • 2
  • 31
  • 48
  • 1
    Are people able to contact others through their IDs? I only ask because a predictable syntax would let spammers iterate over all of your customers. If not, though, that isn't a problem – Mattsjo Aug 15 '13 at 08:14
  • I'm not sure I fully understand your question, but these IDs will be uses sorely to identify records. Whether its posts, profiles, messages etc. Not as usernames or anything like that. – Sthe Aug 15 '13 at 08:17
  • That's fine then. Just ignore me :D – Mattsjo Aug 15 '13 at 08:19
  • And is it possible to retrieve user somehow by this `ID`? I think spamers will be happy to grab entire cluster user data. – Alma Do Aug 15 '13 at 08:19
  • 1
    These predictability issues raised by the previous two comments can be resolved by simply adding more entropy. You have the workings of system for generating unique IDs, you could easily start adding other bits of data (timestamp, even just random bytes) and the IDs will continue to be unique as long as the components that make them unique have a rigid structure, they will merely become less predictable. – DaveRandom Aug 15 '13 at 08:22
  • @AlmaDoMundo, Being able to pull out a user publicly by knowing the ID wasn't part of the plan, unless one could bypass permission checks on the API. But I think DaveRandom has come to the rescue regarding this one. I'll add it as a 4th step. – Sthe Aug 15 '13 at 08:26
  • 2
    Ok, like @Mattsjo - ignore me then :) – Alma Do Aug 15 '13 at 08:29
  • About UUID... Only bad thing what you wrote is "There has been some known collisions". Why don't you just append/prepend your region code and incremented ID from Couchbase? – Glavić Aug 15 '13 at 14:06
  • @Glavić, I see your point. My other issue with UUID is it not being an integer (big or small), and not being a number for that matter. Almost all user IDs on the net are integers (just observation), so when developers want to use our API, they would have to add `VARCHAR` field types on their databases for these IDs. Most relational database (if not all), are known to index integers better than strings. I'm trying to ensure the ease of use on our services. – Sthe Aug 15 '13 at 15:48
  • Thank you all for your contribution. Because there is no right and wrong answer in this subject (only a matter of taste). I'll accept scalabl3's answer since it had more points that are already contributing to the proposed solution. Warm regards from South Africa :-) – Sthe Aug 16 '13 at 23:15

2 Answers2

1

You are concerned about IDs for two reasons:

  1. Potential for collisions in a complex network infrastructure
  2. Appearance

Starting with the second issue, Appearance. While a UUID certainly isn't a great beauty when it comes to an identifier, there are diminishing returns as you introduce a truly unique number across a complex data center (or data centers) as you mention. I'm not convinced that there is a dramatic change in perception of an application when a long number versus a UUID is used for example in a URL to a web application. Ideally, neither would be shown, and the ID would only ever be sent via Ajax requests, etc. While a nice clean memorable URL is preferable, it's never stopped me from shopping at Amazon (where they have absolutely hideous URLs). :)

Even with your proposal, the identifiers, while they would be shorter in the number of characters than a UUID, they are no more memorable than a UUID. So, the appearance likely would remain debatable.

Talking about the first point..., yes, there are a few cases where UUIDs have been known to generate conflicts. While that shouldn't happen in a properly configured and consistently obtained architecture, I can see how it might happen (but I'm personally a lot less concerned about it).

So, if you're talking about alternatives, I've become a fan of the simplicity of the MongoDB ObjectId and its techniques for avoiding duplication when generating an ID. The full documentation is here. The quick relevant pieces are similar to your potential design in several ways:

ObjectId is a 12-byte BSON type, constructed using:

  • a 4-byte value representing the seconds since the Unix epoch,
  • a 3-byte machine identifier,
  • a 2-byte process id, and
  • a 3-byte counter, starting with a random value.

The timestamp can often be useful for sorting. The machine identifier is similar to your application server having a unique ID. The process id is just additional entropy, and finally to prevent conflicts, there is a counter that is auto incremented whenever the timestamp is the same as the last time an ObjectId is generated (so that ObjectIds can be created rapidly). ObjectIds can be generated on the client or on the database. Further, ObjectIds do take up fewer bytes than a UUID (but only 4). Of course, you could not use the timestamp and drop 4 bytes.

For clarification, I'm not suggesting you use MongoDB, but be inspired by the technique they use for ID generation.

So, I think your solution is decent (and maybe you want to be inspired by MongoDB's implementation of a unique ID) and doable. As to whether you need to do it, I think that's a question only you can answer.

WiredPrairie
  • 58,954
  • 17
  • 116
  • 143
  • Thank you for the answer, but: `Firstly`: We are already using Couchbase and we are very far with development (not to mention that we are happy with it). So adding MongoDB to the picture (if that's what you mean) is no more different from adding Snowflake (or something similar: more complexity). `Secondly`: When you say `...in a properly configured...`, this involves syncing clocks etc (correct me if Im wrong). Which is one of the things I'm trying to avoid. I know they have to be in sync, but I don't want it to be something that can cause a severe multifunctional. Or have I misunderstood you? – Sthe Aug 15 '13 at 15:57
  • Sorry -- I wasn't suggesting you use MongoDB at all -- just using their ObjectID as inspiration and comparison to your idea. I've added a clarification to the answer. – WiredPrairie Aug 15 '13 at 16:45
  • And only v1 and v2 of UUID relied on timestamps. http://en.wikipedia.org/wiki/Universally_unique_identifier – WiredPrairie Aug 15 '13 at 16:53
1

I think this looks pretty solid. Each region maintains consistency, and if you use XDCR there are no collisions. INCR is atomic within a cluster, so you will have no issues there. You don't actually need to have the Machine code part of it. If all the app servers within a region are connected to the same cluster, it's irrelevant to infix the 00001 part of it. If that is useful for you for other reasons (some sort of analytics) then by all means, but it isn't necessary.

So it can simply be '4' . 1' (using your example)

Can you give me an example of what kind of "sorting" you need?

First: One downside of adding entropy (and I am not sure why you would need it), is you cannot iterate over the ID collection as easily.

For Example: If you ID's from 1-100, which you will know from a simple GET query on the Counter key, you could assign tasks by group, this task takes 1-10, the next 11-20 and so on, and workers can execute in parallel. If you add entropy, you will need to use a Map/Reduce View to pull the collections down, so you are losing the benefit of a key-value pattern.

Second: Since you are concerned with readability, it can be valuable to add a document/object type identifier as well, and this can be used in Map/Reduce Views (or you can use a json key to identify that).

Ex: 'u:' . '4' . '1'

If you are referring to ID's externally, you might want to obscure in other ways. If you need an example, let me know and I can append my answer with something you could do.

@scalabl3

scalabl3
  • 1,273
  • 6
  • 7
  • Thanks for the contribution. `First`: I think I've already lost sorting capabilities (across all regions: out of the box), so I'm no longer worried about it. We've had quite a number of issues with Couchbase Views (Map/Reduce) (data consistency, not being able to search using fields in the document etc.), so we have replaced their need with ElasticSearch (working great so far). We actually use it for queries just like in your example above. Good point with Machine Code. I now just need to think if I really need it. – Sthe Aug 15 '13 at 16:13
  • Map/Reduce is to provide indexes, so you are indexing single fields (most of the time) and then querying for single or range results (again, most of the time). ES is there for that full text/keyword search type queries... – scalabl3 Aug 16 '13 at 16:46
  • To prevent getting off the subject, let us end the Map/Reduce => ES part of the debate here :-). But we will only be using Map/Reduce for some statistics and aggregation. Not indexing. We have already encountered a lot of data consistency issues with it (especially since you can't update the view directly. You would often have to call them with `stale : false`, which can be quite resource intensive). Regarding the object identifiers, you make a great point. We are already using this, but internally to namespace documents. Example: `user:40000523` etc. Thank you for such valuable contribution. – Sthe Aug 16 '13 at 23:11
  • 1
    No problem! My pleasure! Just to be clear and correct, you didn't have data consistency issues with your M/R Views, you had "index" consistency issues. Your data is your documents, and Couchbase has a strong consistency model. There are strategies to reduce the time lag between CRUD and index updates, and you can tweak settings. Generally there are more issues at low CRUD volume than at higher volumes because of what triggers the Indexers; 5000 updates or 5 seconds or stale=false, more info: http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-operation-autoupdate.html – scalabl3 Aug 19 '13 at 04:19
  • Hhmmm. So, that means the reason why we were having those issues is because of low number of CRUD operations causing Couchbase to reach the Re-indexing threshold at the latest time set? That makes a lot of sense now. Thank you very much :-) – Sthe Aug 19 '13 at 07:38
  • Technically it's not "re-indexing" it's just "indexing" -- in other words it's indexing the next batch of data that has been written to disk since the last indexing batch. It's incremental, append-only b+-tree so it only "starts from scratch" when you redefine the definition of the index (onDesignDocumentPublish). :) The "triggers" are the time elapsed since, the number of updates, or using stale=false, which is what is configurable... :) – scalabl3 Aug 21 '13 at 15:11