63

Given that TimeUUID handily allows you to use now() in CQL, are there any reasons you wouldn't just go ahead and always use TimeUUID instead of plain old UUID?

giampaolo
  • 6,906
  • 5
  • 45
  • 73
Jay
  • 19,649
  • 38
  • 121
  • 184

3 Answers3

74

UUID and TIMEUUID are stored the same way in Cassandra, and they only really represent two different sorting implementations.

TIMEUUID columns are sorted by their time components first, and then by their raw bytes, whereas UUID columns are sorted by their version first, then if both are version 1 by their time component, and finally by their raw bytes. Curiosly the time component sorting implementations are duplicated between UUIDType and TimeUUIDType in the Cassandra code, except for different formatting.

I think of the UUID vs. TIMEUUID question primarily as documentation: if you choose TIMEUUID you're saying that you're storing things in chronological order, and that these things can occur at the same time, so a simple timestamp isn't enough. Using UUID says that you don't care about order (even if in practice the columns will be ordered by time if you put version 1 UUIDs in them), you just want to make sure that things have unique IDs.

Even if using NOW() to generate UUID values is convenient, it's also very surprising to other people reading your code.

It probably does not matter much in the grand scheme of things, but sorting non-version 1 UUIDs is a bit faster than version 1, so if you have a UUID column and generate the UUIDs yourself, go for another version.

Theo
  • 131,503
  • 21
  • 160
  • 205
  • 1
    How would sorting non-Version 1 UUIDs be faster? For example, Version 4 UUIDs are completely random which I expect would provide the worst sorting performance. I do agree that the issue should be immaterial. If you are using UUIDs, you do so for any of several good reasons but performance is not amongst them. Fortunately, today's computers can handle the space and sorting demands made by UUIDs. – Basil Bourque Jul 31 '13 at 03:38
  • 3
    The content of the UUIDs is not relevant to the performance of the sorting algorithm. Non-version 1 sorts faster _in Cassandra_ because no unpacking of the bytes into a timestamp happens. It's a very, very small difference, I just thought it was interesting. – Theo Jul 31 '13 at 13:31
  • is the now() function the only way to generate timeuuid? is it possible to generate custom ones? its only for testing that I need custom ones. – Charlie Parker Apr 03 '14 at 23:18
  • Good question Pinocchio. Maybe not the answer, but I do known there is minTimeuuid() and maxTimeuuid(). Example: insertion_time < minTimeuuid('2015-04-04 22:05+0000') AND insertion_time > maxTimeuuid('2015-04-03 22:05+0000'); – Melroy van den Berg Apr 03 '15 at 23:56
  • @Theo when you say "you just want to make sure that things have unique ID" , do you mean that timeuuid might not be unique ? For example can I store users with timeuuid as partition key ? – Ced Oct 27 '16 at 03:00
31

A TimeUUID is a plain old UUID according to the documentation.

A UUID is simply a 128-bit value. Think of it as an unimaginably large number.

The particular bits may be determined by any of several methods. The original method involved taking the MAC address of the computer's networking hardware, combining the current date and time, plus an arbitrary number and a random number. Squish all that together to get a virtually unique number.

Later, for various reasons (security, privacy), other methods were invented to assemble the bits when generating a UUID value. These other methods omit date-time and/or MAC address as an ingredient. The point being: Not all UUID values have an embedded date-time value.

The Cassandra doc incorrectly refers to its TimeUUID being a "Type 1 UUID". The correct term is Version 1 UUID. This version is sometimes called the "time-based version".


A Bit Of Advice

Cassandra seems to identify this specific version of UUID for the purpose of extracting the date and time portion of the 128-bits. Extracting the date-time from a UUID is a bad idea.

For one thing, UUID was never intended to be used for such history tracking. Indeed, the spec for UUID specifically recognizes that (a) computer clocks can be reset and therefor (b) UUIDs generated later may actually record an earlier date-time than previous UUIDs. Another reason to not extract date-time from a UUID is because you may well have UUIDs that were not generated by the time method, therefore you will be building a data-time value based on bits that do not in fact represent the date-time of creation. A third reason is that when programming code is later refactored, the UUID may be generated at a different time than the database record so using the UUID's date-time would be misleading.

If you need to track date-time history, do so explicitly. Create a date-time field in your data. By the way, track that date-time in UTC, but that’s another topic.

Basil Bourque
  • 303,325
  • 100
  • 852
  • 1,154
  • 3
    For the record, Cassandra doc explicitly advised to use ntp to synchronize system time across all nodes. http://www.datastax.com/documentation/cassandra/1.2/webhelp/cassandra/install/installRecommendSettings.html – John Jul 31 '13 at 14:08
  • 17
    Agreed on using UTC. But to address your other concerns: 1) Timestamps also suffer from clock drift, so they are no better than TimeUUID in this regard for time series data. 2) In the context of CQL3 and a Cassandra schema using a TimeUUID datatype, it's reasonable to expect that all of the UUIDs in those columns are time-encoded, type 1 UUIDs. 3) In CQL3 you can either use NOW() or a specific datetime to create TimeUUIDs on insert. So processing old data can still result in historically correct TimeUUIDs in a Cassandra table. – platforms Sep 18 '13 at 15:24
  • 2
    @platforms Conflating two different purposes into a single value is a bad idea in principal, a bad practice. In this case, 1. date-time history tracking and 2. primary key identifier. When the day comes that you want to export or import data with other systems/sources/sinks, you'll have regrets. As further proof of the confusion created needlessly, while **gaining nothing in return**, re-read the Question of this page! – Basil Bourque Sep 23 '14 at 19:02
  • 3
    No regrets so far, on a system in production for over a year, including data exports and imports. But I understand your principled argument and agree with the kind of separation-of-concerns notion that likely informs your opinion. In practice, for the purposes of indexing time series data on Cassandra, I find the use of TimeUUID's extremely useful. But, in principle, would I choose any form of UUID as the best way to store a time value? No. – platforms Sep 23 '14 at 21:36
  • 3
    TimeUUID type in Cassandra is a meta data (as any Cassandra type) allowing Cassandra to know of to interpret the data (e.g. getting the date and creating a UUID base on a date or now). The gain of using it is to prevent data duplication if you need to access a row directly AND list the rows sorted on date. It only make sense as a composite key. If you have 2 fields (date and unique id) you will have a table with date first and the id after (in the composite key) to do the sorting and a second with id first and date after (for direct access). – Kazaag Jun 02 '15 at 12:27
  • Agree for most cases recording a UTC date is best but for time-series data it may be best in some cases to pre-partition the data in a designated time zone. – KingOfHypocrites Jun 03 '15 at 17:00
2

All said, you need to generate some to believe them. Timeuuids are Version/Level 1 UUID only seem to randomize the first 8 characters as you can see below, so, there is some chance of conflict, but still timeuuid is better than using timestamp itself. If uuid randomness is important, using Version/Level 4 UUID is a better choice with an almost improbable collision.

So, it feels like if you don't care about uniqueness across partitions and your partitions are wide row time series data with high writes and need some unique identifier for each event (time), its a good choice that also has the benefit of clustering, pagination, etc.,.

insert into test_tuuid(1, now())
insert into test_tuuid(1, now())
insert into test_tuuid(1, now())
insert into test_tuuid(1, now())

49cbda60-961b-11e8-9854-134d5b3f9cf8
49d1a6c1-961b-11e8-9854-134d5b3f9cf8
49d59e61-961b-11e8-9854-134d5b3f9cf8
49d8d2b1-961b-11e8-9854-134d5b3f9cf8
Alexis Wilke
  • 19,179
  • 10
  • 84
  • 156
kisna
  • 2,869
  • 1
  • 25
  • 30
  • 1
    Actually the first 8 characeters are not just random. CQL driver goes through some additional steps to ensure that there are no collisions when generating new TIMEUUID values. https://docs.datastax.com/en/cql-oss/3.x/cql/cql_reference/timeuuid_functions_r.html – Ian Goldby Mar 10 '21 at 12:32