33

Our current PostgreSQL database is using GUID's as primary keys and storing them as a Text field.

My initial reaction to this is that trying to perform any kind of minimal cartesian join would be a nightmare of indexing trying to find all the matching records. However, perhaps my limited understanding of database indexing is wrong here.

I'm thinking that we should be using UUID as these are stored as a binary representation of the GUID where a Text is not and the amount of indexing that you get on a Text column is minimal.

It would be a significant project to change these, and I'm wondering if it would be worth it?

Scottie
  • 11,050
  • 19
  • 68
  • 109
  • Why not build two databases, with simulated production data, one with each primary key strategy, and benchmark? – Kevin Nov 20 '15 at 22:04
  • 1
    @Kevin: I could do that, but I was hoping someone on SO would have the answer so I don't have to... – Scottie Nov 20 '15 at 22:05
  • We don't have your app in front of us. You know which queries you are executing a lot better than we do. – Kevin Nov 20 '15 at 22:06
  • What is the scale of the tables? Are we talking tables with millions of rows? Tens of millions? Billions? – khampson Nov 20 '15 at 22:27
  • It's all of the tables, Some could grow to tens of millions of records. Maybe even hundreds of millions. – Scottie Nov 20 '15 at 22:28
  • What do you mean by `he amount of indexing that you get on a Text column is minimal` – Falmarri Nov 20 '15 at 22:35
  • @Falmarri The way I understand text indexing is that it really can't do much to index a text field and that most of the time a full table scan is required to find the selected rows and return them. But, I'm open to being completely wrong about this... – Scottie Nov 20 '15 at 22:36
  • No that's not generally correct. Mysql can't index TEXT fields (though it can varchar). Postgres can index TEXT fields just fine though – Falmarri Nov 20 '15 at 23:01
  • Awesome! That's what I was looking for. – Scottie Nov 20 '15 at 23:29

2 Answers2

32

When dealing with UUID numbers store them as data type uuid. Always. There is simply no good reason to even consider text as alternative. Input and output is done via text representation by default anyway. The cast is very cheap.

The data type text requires more space in RAM and on disk, is slower to process and more error prone. @khampson's answer provides most of the rationale. Oddly, he doesn't seem to arrive at the same conclusion.

This has all been asked and answered and discussed before. Related questions on dba.SE with detailed explanation:

bigint?

Maybe you don't need UUIDs (GUIDs) at all. Consider bigint instead. It only occupies 8 bytes and is faster in every respect. It's range is often underestimated:

-9223372036854775808 to +9223372036854775807

That's 9.2 millions of millions of millions positive numbers. IOW, nine quintillion two hundred twenty-three quadrillion three hundred seventy-two trillion thirty-six something billion.

If you burn 1 million IDs per second (which is an insanely high number) you can keep doing so for 292471 years. And then another 292471 years for negative numbers. "Tens or hundreds of millions" is not even close.

UUID is really just for distributed systems and other special cases.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • 2
    Well it is for a distributed system with delayed updates, so using guid makes sense. – Scottie Nov 21 '15 at 01:07
  • 3
    @Scottie: That's the use case, yes. – Erwin Brandstetter Nov 21 '15 at 01:10
  • 7
    @ErwinBrandstetter: I didn't arrive at the same conclusion because Scottie said that his DB was already implemented using text GUIDs, and to change it would be a very large undertaking. Given those constraints, I think changing it now would be a premature optimization in the face of other priorities. If starting from scratch, I would definitely go with the native `UUID` type. – khampson Nov 21 '15 at 02:25
  • 1
    @khampson: I see. That makes sense. – Erwin Brandstetter Nov 21 '15 at 03:31
  • upvoted, i want to generate a unique id based on the string title so that no duplicate titles exist, i am generating an MD5 hash of the title and want to convert it to base 64 and store it, UUID type doesnt seem to allow base 64 to be stored, any suggestions – PirateApp Sep 26 '19 at 04:14
  • 2
    @PirateApp: Convert the hex representation of the md5 to type `uuid` directly. This works in Postgres `SELECT md5('foo')::uuid`. See: https://stackoverflow.com/a/8335376/939860 (at the bottom) And: https://dba.stackexchange.com/a/115316/3684 – Erwin Brandstetter Sep 27 '19 at 00:36
21

As @Kevin mentioned, the only way to know for sure with your exact data would be to compare and contrast both methods, but from what you've described, I don't see why this would be different from any other case where a string was either the primary key in a table or part of a unique index.

What can be said up front is that your indexes will probably larger, since they have to store larger string values, and in theory the comparisons for the index will take a bit longer, but I wouldn't advocate premature optimization if to do so would be painful.

In my experience, I have seen very good performance on a unique index using md5sums on a table with billions of rows. I have found it tends to be other factors about a query which tend to result in performance issues. For example, when you end up needing to query over a very large swath of the table, say hundreds of thousands of rows, a sequential scan ends up being the better choice, so that's what the query planner chooses, and it can take much longer.

There are other mitigating strategies for that type of situation, such as chunking the query and then UNIONing the results (e.g. a manual simulation of the sort of thing that would be done in Hive or Impala in the Hadoop sphere).

Re: your concern about indexing of text, while I'm sure there are some cases where a dataset produces a key distribution such that it performs terribly, GUIDs, much like md5sums, sha1's, etc. should index quite well in general and not require sequential scans (unless, as I mentioned above, you query a huge swath of the table).

One of the big factors about how an index would perform is how many unique values there are. For that reason, a boolean index on a table with a large number of rows isn't likely to help, since it basically is going to end up having a huge number of row collisions for any of the values (true, false, and potentially NULL) in the index. A GUID index, on the other hand, is likely to have a huge number of values with no collision (in theory definitionally, since they are GUIDs).

Edit in response to comment from OP:

So are you saying that a UUID guid is the same thing as a Text guid as far as the indexing goes? Our entire table structure is using Text fields with a guid-like string, but I'm not sure Postgre recognizes it as a Guid. Just a string that happens to be unique.

Not literally the same, no. However, I am saying that they should have very similar performance for this particular case, and I don't see why optimizing up front is worth doing, especially given that you say to do so would be a very involved task.

You can always change things later if, in your specific environment, you run into performance problems. However, as I mentioned earlier, I think if you hit that scenario, there are other things that would likely yield better performance than changing the PK data types.

A UUID is a 128-bit data type (so, 16 bytes), whereas text has 1 or 4 bytes of overhead plus the actual length of the string. For a GUID, that would mean a minimum of 33 bytes, but could vary significantly depending on the encoding used.

So, with that in mind, certainly indexes of text-based UUIDs will be larger since the values are larger, and comparing two strings versus two numerical values is in theory less efficient, but is not something that's likely to make a huge difference in this case, at least not usual cases.

I would not optimize up front when to do so would be a significant cost and is likely to never be needed. That bridge can be crossed if that time does come (although I would persue other query optimizations first, as I mentioned above).

Regarding whether Postgres knows the string is a GUID, it definitely does not by default. As far as it's concerned, it's just a unique string. But that should be fine for most cases, e.g. matching rows and such. If you find yourself needing some behavior that specifically requires a GUID (for example, some non-equality based comparisons where a GUID comparison may differ from a purely lexical one), then you can always cast the string to a UUID, and Postgres will treat the value as such during that query.

e.g. for a text column foo, you can do foo::uuid to cast it to a uuid.

There's also a module available for generating uuids, uuid-ossp.

Zano
  • 2,595
  • 27
  • 33
khampson
  • 14,700
  • 4
  • 41
  • 43
  • So are you saying that a UUID guid is the same thing as a Text guid as far as the indexing goes? Our entire table structure is using Text fields with a guid-like string, but I'm not sure Postgre recognizes it as a Guid. Just a string that happens to be unique. – Scottie Nov 20 '15 at 22:53