3

I need to migrate a table from Cassandra to PostgreSQL.

What I need to migrate: The table has a TimeUUID column for storing time as UUID. This column also served as clustering key. Time was stored as UUID to avoid collisions when rows are inserted in the same millisecond. Also, this column was involved in where clause, typically timeUUID between 'foo' and 'bar' and it produced correct results.

Where I need to migrate it to: I'm moving to Postgres so need to find a suitable alternative to this. PostgreSQL has UUID data type but from what I've read and tried so far it stores it as 4-byte int but it treats UUID similar to String when used in where clause with relational operator.

select * from table where timeUUID > 'foo' will have xyz in the result.

According to my understanding, it is not necessary for UUID or even TimeUUID to be always increasing. Due to this Postgres produces the wrong result when compared to Cassandra with the same dataset.

What I've considered so far: I considered storing it as BIGINT but it will be susecptible to collisions for time resolution in milliseconds. I can go for resolution of mirco/nano seconds but I'm afraid BIGINT will exhaust it.

Storing UUID as CHAR will prevent collisions but then I'll lose the capability to apply relational operators on the column.

TIMESTAMP fits the best but I'm worried about timezone and collisions.

What I exactly need (tl;dr):

  1. Some way to have higher time resolution or way to avoid collision (unique value generation).

  2. The column should support relational operators, i.e uuid_col < 'uuid_for_some_timestamp'.

PS: This is a Java application.

MePsyDuck
  • 364
  • 2
  • 15

2 Answers2

6

tl;dr

Stop thinking in Cassandra terms. The designers made some flawed decisions in their design.

➥ Do not mix the two.

Mixing the two is the flaw in Cassandra.

Cassandra abuses UUID

Unfortunately, Cassandra abuses UUIDs. Your predicament shows the unfortunate foolishness of their approach.

The purpose of a UUID is strictly to generate an identifier without needing to coordinate with a central authority as would be needed for other approaches such as a sequence number.

Cassandra uses Version 1 UUIDs, which take the current moment, plus an arbitrary small number, and combine with the MAC address of the issuing computer. All this data goes to make up most of the 128 bits in a UUID.

Cassandra makes the terrible design decision to extract that moment in time for use in time-tracking, violating the intent of the UUID design. UUIDs were never intended to be used for time tracking.

There are several alternative Versions in the UUID standard. These alternatives do not necessarily contain a moment in time. For example, Version 4 UUIDs instead use random numbers generated from a cryptographically-strong generator.

If you want to generate Version 1 UUIDs, install the uuid-ossp plugin (“extension”) (wrapping the OSSP uuid library) usually bundled with Postgres. That plugin offers several functions you can call to generate UUID values.

[Postgres] stores it as 4-byte int

Postgres defines UUID as a native data type. So how such values get stored is really none of our business, and could change in future versions of Postgres (or in its new pluggable storage methods). You pass in a UUID, and you’ll get back a UUID, that’s is all we know as users of Postgres. As a bonus, it is good to learn that Postgres (in its current “heap” storage method) stores UUID values efficiently as 128 bits, and not inefficiently as, for example, storing the text of the hex string canonically used to display a UUID to humans.

Note that Postgres has built-in support for storing UUID values, not generating UUID values. To generate values:

  • Some folks use the pgcrypto extension, if already installed in their database. That plugin can only generate Version 4 nearly-all-random UUIDs.
  • I suggest you instead use the uuid-ossp extension. This gives you a variety of Versions of UUID to choose.

To learn more, see: Generating a UUID in Postgres for Insert statement?

As for your migration, I suggest “telling the truth” as a generally good approach. A date-time value should be stored in a date-type column with an appropriately labeled name. An identifier should be stored in a primary key column of an appropriate type (often integer types, or UUID) with an appropriately labeled name.

So stop playing the silly clever games that Cassandra plays.

Extract the date-time value, store it in a date-time column. Postgres has excellent date-time support. Specifically, you’ll want to store the value in a column of the SQL-standard type TIMESTAMP WITH TIME ZONE. This data type represents a moment, a specific point on the timeline.

The equivalent type in Java for representing a moment would be Instant or OffsetDateTime or ZonedDateTime. The JDBC 4.2 spec requires support only for the second, inexplicably, not the first or third. Search Stack Overflow for more of this Java and JDBC info as it has been covered many many times already.

Continue to use UUID but only as the designated primary key column of your new table in Postgres. You can tell Postgres to auto-generate these values.

Storing UUID as CHAR

No, do not store UUID as text.

TIMESTAMP fits the best but I'm worried about timezone and collisions.

There is a world of difference between TIMESTAMP WITH TIME ZONE and TIMESTAMP WITHOUT TIME ZONE. So never say just TIMESTAMP.

Postgres always stores a TIMESTAMP WITH TIME ZONE in UTC. Any time zone or offset information included with a submitted value is used to adjust to UTC, and then discarded. Java retrieves values of this type as UTC. So no problem.

The problem comes when using other tools that have the well-intentioned but tragically-flawed feature of dynamically applying a default time zone while generating text to display the value of the field. The value retrieved from Postgres is always in UCT, but its presentation may have been adjusted to another offset or zone. Either avoid such tools or be sure to set the default zone to UTC itself. All programmers, DBAs, and sysadmins should learn to work and think in UTC while on the job.

TIMESTAMP WITHOUT TIME ZONE is entirely different. This type lacks the context of a time zone or offset-from-UTC. So this type cannot represent a moment. It holds a date and a time-of-day but that's all. And that is ambiguous of course. If the value is noon on the 23rd of January this year, we do not know if you mean noon in Tokyo, noon in Tehran, or noon in Toledo — all very different moments, several hours apart. The equivalent type in Java is LocalDateTime. Search Stack Overflow to learn much more.

Table of date-time types in Java (both legacy and modern) and in standard SQL.

Time was stored as UUID to avoid collisions when rows are inserted in the same millisecond.

Version 1 UUID track and time with a resolution as fine as 100 nanoseconds (1/10th of a microsecond), if the host computer hardware clock can do so. The java.time classes capture time with a resolution of microseconds (as of Java 9 and later). Postgres stores moments with a resolution of microseconds. So with Java & Postgres, you’ll be close in this regard to Cassandra.

Storing the current moment.

OffsetDateTime odt = OffsetDateTime.now( ZoneOffset.UTC ) ;
myPreparedStatement.setObject( … , odt ) ;

Retrieval.

OffsetDateTime odt = myResultSet.getObject( … , OffsetDateTime.class ) ;

I can go for resolution of mirco/nano seconds

No you cannot. Conventional computer clocks today cannot precisely track time in nanoseconds.

And using time-tracking solely as an identifier value is a flawed idea.

it is not necessary for UUID or even TimeUUID to be always increasing

You can never count on a clock always increasing. Clocks get adjusted and reset. Computer hardware clocks are not that accurate. Not understanding the limitations of computer clocks is one of the naïve and unreasonable aspects of Cassandra’s design.

And this is why a Version 1 UUID uses an arbitrary small number (called the clock sequence) along with the current moment, because the current moment could repeat when a clock gets reset/adjusted. A responsible UUID implementation is expected to notice the clock falling back, and then increment that small number to compensate and avoid duplicates. Per RFC 4122 section 4.1.5:

For UUID version 1, the clock sequence is used to help avoid duplicates that could arise when the clock is set backwards in time or if the node ID changes.

If the clock is set backwards, or might have been set backwards (e.g., while the system was powered off), and the UUID generator can not be sure that no UUIDs were generated with timestamps larger than the value to which the clock was set, then the clock sequence has to be changed. If the previous value of the clock sequence is known, it can just be incremented; otherwise it should be set to a random or high-quality pseudo-random value.

There is nothing in the UUID specifications that promises to be “always increasing”. Circling back to my opening statement, Cassandra abuses UUIDs.

Basil Bourque
  • 303,325
  • 100
  • 852
  • 1,154
  • I guess I should have been more clear. In the new PostgreSQL table, I do have a separate `ID` primary key column. My only needs were to have a way to store time precisely (albeit what I asked in the question meant I want the column to unique), and to be able to apply relational operators on it. Anyhow, I've decided to go use TIMESTAMP (with or without tz I'm yet to decide) . – MePsyDuck Aug 28 '19 at 04:25
  • 1
    @MePsyDuck Postgres has a [rich assortment of date-time functions](https://www.postgresql.org/docs/current/functions-datetime.html) for you to use in making queries. In some cases, you may find it easier to work on the Java side, using the industry-leading *java.time* framework. See [Tutorial by Oracle](https://docs.oracle.com/javase/tutorial/datetime/TOC.html). And I added some more text explaining that you must use `TIMESTAMP WITH TIME ZONE` when tracking a specific moment in time, not the `WITHOUT` type. You will find much more info by searching Stack Overflow. – Basil Bourque Aug 28 '19 at 05:00
  • 1
    @MePsyDuck, regarding timestamps, *store everything in UTC*. As Basil mentioned, Postgres is pretty capable at time calculations. Check out AT TIME ZONE to get you started on the most obvious annoyance of keeping everything in UTC. – Morris de Oryx Aug 28 '19 at 07:09
1

It sounds like a Cassandra TimeUUID is a version 1 UUID, while Postgres generates a version 4 UUID. You can generate V1 in Postgres too:

https://www.postgresql.org/docs/11/uuid-ossp.html

I use pg_crypto for UUIDs, but it only generates V4.

Others can say more authoritatively, but I remember UUIDs by 128-bit/16-byte types in Postgres that don't readily cast to numbers. You can cast them to text or even a binary string:

SELECT DECODE(REPLACE(id::text, '-',''), 'hex') from foo;

I can't imagine that's a super fast or good idea...

From what you say, your issue is around sorting by the timestamp element. Ancoron Luciferis has been working on this question, I believe. You can find some of his test results here:

https://github.com/ancoron/pg-uuid-test

Within Postgres, the serial "types" are the standard feature used for unique sequence numbers. So, BIGSERIAL instead of BIGINT, in what you were saying. The timestamp columns are great (also 8 bytes), but not so suitable for a unique ID. In our setup, we're using V4 UUIDs for synthetic keys, and timestamptz fields for timestamps. So, we've got two columns instead of one. (Postgres is a centralized collector for a lot of different data sources here, which is why we use UUIDs instead of serial counters, BTW.) Personally, I like having timestamps that are timestamps as they're easier to work with, reason about, and search on at different levels of granularity. Plus! You may get to take advantage of Postgres amazing BRIN index type:

https://www.postgresql.fastware.com/blog/brin-indexes-what-are-they-and-how-do-you-use-them

Morris de Oryx
  • 1,857
  • 10
  • 28
  • 1
    The `SERIAL` types in Postgres have been supplanted by identity columns. See: https://www.2ndquadrant.com/en/blog/postgresql-10-identity-columns/ – Basil Bourque Aug 28 '19 at 02:21
  • Thanks for the answer, but my concern is about storing time with high resolution and to be able to apply relation operators. I can generate UUID in the application without the need of any plugin. I also should've mentioned that I cannot use plugins due to some reasons. – MePsyDuck Aug 28 '19 at 04:29
  • Basil! Thanks for the tip on identity columns and the excellent-looking link, I'll check it out. – Morris de Oryx Aug 28 '19 at 07:06
  • @ MePsyDuck - I don't know Cassandra, and am not clear what you mean by "apply relation operators." In Postgres, you've got some powerful options. For example, generate_series() to get you a synthetic record set with intervals as you like, say every 2 hours. Then LEFT JOIN your data against that. Except that instead your row (highly granular) DTS values, use DATE_TRUNC(), or EXTRACT, to change the DTS to something less granular, like hours, etc. Works great. – Morris de Oryx Aug 28 '19 at 07:11
  • It was a typo. My use case is rather simple. I wanted to be able to compare two UUIDs with `>` and `<` operators. In Cassandra it works, as it compares the timestamp behind the UUID, but Postgres compares the string value of the UUID. Anyways, I'm going with @Basil's advice and using timestamp wtih tz. – MePsyDuck Aug 28 '19 at 09:33