167

Do I use varchar(36) or are there any better ways to do it?

conny
  • 9,973
  • 6
  • 38
  • 47
CDR
  • 8,198
  • 11
  • 47
  • 46
  • 1
    "thaBadDawg" offers a good answer. There is a parallel thread on Stack Overflow that discusses the topic. I added some comments to that threads answer that link to resources with more detail. Here is the question link: http://stackoverflow.com/questions/547118/storing-mysql-guid-uuids - I expect this topic to become more common when people start considering AWS and Aurora. – Zack Jannsen Feb 04 '16 at 10:58

10 Answers10

120

My DBA asked me when I asked about the best way to store GUIDs for my objects why I needed to store 16 bytes when I could do the same thing in 4 bytes with an Integer. Since he put that challenge out there to me I thought now was a good time to mention it. That being said...

You can store a guid as a CHAR(16) binary if you want to make the most optimal use of storage space.

thaBadDawg
  • 5,160
  • 6
  • 35
  • 44
  • 191
    Because with 16 bytes, you can generate things in different databases, on different machines, at different times, and still merge the data together seamlessly :) – Billy ONeal Sep 02 '10 at 00:12
  • 5
    need reply, what really is a char 16 binary? not char? not binary? I dont see that type in any of the mysql gui tools, nor any documentation in mysql site. @BillyONeal – nawfal Jun 24 '12 at 19:41
  • 3
    @nawfal: Char is the datatype. BINARY is the type specifier against the type. The only effect it has is to modify how MySQL does collation. See http://dev.mysql.com/doc/refman/5.0/en/charset-binary-op.html for more details. Of course you can just use a BINARY type directly if your database editing tool allows you to do that. (Older tools don't know of the binary data type but do know of the binary column flag) – Billy ONeal Jun 25 '12 at 03:48
  • @BillyONeal so I tried `create table p (id char(16))`. So all it did was create a char(16) column with utf8_bin collation. I do not know how to insert a 36 char hexadecimal string (GUID) into it. When I try, I get some illegal characters error. – nawfal Jun 25 '12 at 06:48
  • @BillyONeal moreover do u think char 16 binary can outperform binary 16 performancewise? I doubt – nawfal Jun 25 '12 at 06:48
  • @newfal: In your application you turn that 36 character hexadecimal string into the 16 bytes it actually represents, and store that instead. (You don't put the hexadecimal representation in the database if you go the char(16) route). As far as char(16) binary vs. binary(16), I see no reason why there would be any difference in performance between the two; the only difference would be sorting and you shouldn't sort by GUID for perf-sensitive areas anyway. – Billy ONeal Jun 26 '12 at 19:32
  • 2
    a CHAR and a BINARY field are essentially the same. If you want to take it to the most basic of levels, a CHAR is a binary field expecting a 0 to 255 value with the intention of representing said value with a value mapped from a lookup table (in most cases now, UTF8). A BINARY field expects the same kind of value without any intention of representing said data from a lookup table. I used CHAR(16) back in the 4.x days because back then MySQL wasn't as good as it is now. – thaBadDawg Jun 27 '12 at 16:10
  • @BillyONeal I tried with two query: `CHAR(16) BINARY` and `CHAR(16) CHARACTER SET binary`. The create code I got after executing those were, respectively: `CHAR(16) NULL DEFAULT NULL COLLATE 'utf8_bin'` and `BINARY(16) NULL DEFAULT NULL`. In short, the second query created nothing but a binary field. And it is only in such a field I could store guid. Pure `CHAR(16) utf8_bin` field never helped me to store guids. Or may be I do not know how to insert into such a field! – nawfal Jun 27 '12 at 22:49
  • @newfal: I don't see how it makes any difference. They are both fixed width 16 byte fields, with no restrictions on the data contained inside. – Billy ONeal Jun 29 '12 at 22:34
  • 20
    There are several good reasons why a GUID is far better than a autoincrement. Jeff Atwood lists [these one](http://www.codinghorror.com/blog/2007/03/primary-keys-ids-versus-guids.html). To me, the best advantage in using a GUID is that my app won't need a database roundtrip to know the key of an entity: I could populate it programmatically, which I could not do if I were using an auto-increment field. This saved me from several headaches: with GUID I can manage the entity in the same way, regardless of the entity has been already persisted or it a brand new one. – Arialdo Martini Dec 30 '12 at 09:11
  • @BillyONeal what about collisions? – symbiont Dec 28 '20 at 09:26
  • @ArialdoMartini have you followed the discussion in the comments of that blog entry? i'm not sold on GUIDs. to me, it feels similar to using the same database field, to store multiple values. if you have a problem with website URLs, then it should be solved in the website – symbiont Dec 28 '20 at 09:53
57

I would store it as a char(36).

Brian Fisher
  • 23,519
  • 15
  • 78
  • 82
  • 6
    I can't see why you should store `-`s. – Afshin Mehrabani Jan 30 '17 at 15:30
  • 2
    @AfshinMehrabani It's simple, straightforward, human-readable. It's not necessary, of course, but if storing those extra bytes doesn't hurt then this is the best solution. – user1717828 Jan 04 '18 at 15:21
  • 2
    Storing the dashes might not be a good idea because it will cause more overhead. If you want to make it human readable, make the application read with the dashes. – Lucca Ferri Jul 23 '18 at 02:45
  • @AfshinMehrabani another consideration is parsing it from the database. Most implementations will expect dashes in a valid guid. – Ryan Gates Jun 19 '19 at 21:58
  • You can insert the hyphens when fetching to convert a char(32) to char(36) easily . use the Insert FN of mySql. – joedotnot Jan 25 '20 at 08:13
  • 1
    @joedotnot It's a trade off and depends on how your data is going to be used. Formatting the GUID with dashes isn't free it will add CPU cycles to every query that requires it. Including them in the database will bloat the database itself and add extra CPU cycles to your queries. If your data is going to be write-many/read-few, then store it without the dashes and format as needed. If your data is going to have lots of queries run against it then you might be better off storing your data formatted with dashes. – Nick Fotopoulos Apr 11 '22 at 20:11
33

Adding to the answer by ThaBadDawg, use these handy functions (thanks to a wiser collegue of mine) to get from 36 length string back to a byte array of 16.

DELIMITER $$

CREATE FUNCTION `GuidToBinary`(
    $Data VARCHAR(36)
) RETURNS binary(16)
DETERMINISTIC
NO SQL
BEGIN
    DECLARE $Result BINARY(16) DEFAULT NULL;
    IF $Data IS NOT NULL THEN
        SET $Data = REPLACE($Data,'-','');
        SET $Result =
            CONCAT( UNHEX(SUBSTRING($Data,7,2)), UNHEX(SUBSTRING($Data,5,2)),
                    UNHEX(SUBSTRING($Data,3,2)), UNHEX(SUBSTRING($Data,1,2)),
                    UNHEX(SUBSTRING($Data,11,2)),UNHEX(SUBSTRING($Data,9,2)),
                    UNHEX(SUBSTRING($Data,15,2)),UNHEX(SUBSTRING($Data,13,2)),
                    UNHEX(SUBSTRING($Data,17,16)));
    END IF;
    RETURN $Result;
END

$$

CREATE FUNCTION `ToGuid`(
    $Data BINARY(16)
) RETURNS char(36) CHARSET utf8
DETERMINISTIC
NO SQL
BEGIN
    DECLARE $Result CHAR(36) DEFAULT NULL;
    IF $Data IS NOT NULL THEN
        SET $Result =
            CONCAT(
                HEX(SUBSTRING($Data,4,1)), HEX(SUBSTRING($Data,3,1)),
                HEX(SUBSTRING($Data,2,1)), HEX(SUBSTRING($Data,1,1)), '-', 
                HEX(SUBSTRING($Data,6,1)), HEX(SUBSTRING($Data,5,1)), '-',
                HEX(SUBSTRING($Data,8,1)), HEX(SUBSTRING($Data,7,1)), '-',
                HEX(SUBSTRING($Data,9,2)), '-', HEX(SUBSTRING($Data,11,6)));
    END IF;
    RETURN $Result;
END
$$

CHAR(16) is actually a BINARY(16), choose your preferred flavour

To follow the code better, take the example given the digit-ordered GUID below. (Illegal characters are used for illustrative purposes - each place a unique character.) The functions will transform the byte ordering to achieve a bit order for superior index clustering. The reordered guid is shown below the example.

12345678-9ABC-DEFG-HIJK-LMNOPQRSTUVW
78563412-BC9A-FGDE-HIJK-LMNOPQRSTUVW

Dashes removed:

123456789ABCDEFGHIJKLMNOPQRSTUVW
78563412BC9AFGDEHIJKLMNOPQRSTUVW
MD004
  • 581
  • 1
  • 7
  • 19
KCD
  • 9,873
  • 5
  • 66
  • 75
  • Here's the above GuidToBinary without removing the hyphens from the string: CREATE FUNCTION `GuidToBinary`($guid char(36)) RETURNS binary(16) RETURN CONCAT( UNHEX(SUBSTRING($guid, 7, 2)), UNHEX(SUBSTRING($guid, 5, 2)), UNHEX(SUBSTRING($guid, 3, 2)), UNHEX(SUBSTRING($guid, 1, 2)), UNHEX(SUBSTRING($guid, 12, 2)), UNHEX(SUBSTRING($guid, 10, 2)), UNHEX(SUBSTRING($guid, 17, 2)), UNHEX(SUBSTRING($guid, 15, 2)), UNHEX(SUBSTRING($guid, 20, 4)), UNHEX(SUBSTRING($guid, 25, 12))); – Jonathan Oliver Jan 15 '13 at 14:33
  • 5
    For the curious, these functions are superior to just UNHEX(REPLACE(UUID(),'-','')) because it arranges the bits in an order that will perform better in a clustered index. – Slashterix Feb 02 '14 at 01:28
  • This is very helpful, but I feel it could be improved with a source for `CHAR` and `BINARY` equivalency ([the docs](http://dev.mysql.com/doc/refman/5.0/en/binary-varbinary.html) seem to imply there are important differences and an explanation of why clustered index performance is better with reordered bytes. – Patrick M Sep 08 '14 at 21:48
  • When I use this my guid is changed. I've tried inserting it using both unhex(replace(string, '-', '')) and the function above and when I convert them back using the same methods the guid that is selected is not the one that was inserted. What is transforming the guid? All I've done is copied the code from above. – Misbit Dec 17 '15 at 14:53
  • @JonathanOliver Could you please share the code for BinaryToGuid() function? – Arun Avanathan Mar 16 '19 at 03:24
  • 1
    Wow too verbose. i'll stick with UNHEX(REPLACE(UUID(),'-','') and nested insert functions insert( insert( insert( insert(HEX(MyBin16Col),9,0,'-'), 14,0,'-'), 19,0,'-'), 24,0,'-') – joedotnot Jan 25 '20 at 08:17
  • Could someone elaborate on the guid reodering part? How do these specific indexes affect index clustering? – Maksim Vi. Feb 06 '23 at 23:51
28

char(36) would be a good choice. Also MySQL's UUID() function can be used which returns a 36-character text format (hex with hyphens) which can be used for retrievals of such IDs from the db.

Learning
  • 8,029
  • 3
  • 35
  • 46
25

"Better" depends on what you're optimizing for.

How much do you care about storage size/performance vs. ease of development? More importantly - are you generating enough GUIDs, or fetching them frequently enough, that it matters?

If the answer is "no", char(36) is more than good enough, and it makes storing/fetching GUIDs dead-simple. Otherwise, binary(16) is reasonable, but you'll have to lean on MySQL and/or your programming language of choice to convert back and forth from the usual string representation.

candu
  • 2,827
  • 23
  • 18
  • 3
    If you host the software (ie a web page for example) and don't sell/install in the client, you can always start with char(36) for easy development in the early stage of the software, and mutate to a more compact format as the system grows in usage and starts needing optimization. – Xavi Montero Sep 13 '14 at 08:27
  • 1
    The biggest down side of the much larger char(36) is how much space the index will take. If you have large number of records in the database, you are doubling the size of the index. – bpeikes Jul 15 '15 at 18:58
8

Binary(16) would be fine, better than use of varchar(32).

Onkar Janwa
  • 3,892
  • 3
  • 31
  • 47
7

The GuidToBinary routine posted by KCD should be tweaked to account for the bit layout of the timestamp in the GUID string. If the string represents a version 1 UUID, like those returned by the uuid() mysql routine, then the time components are embedded in letters 1-G, excluding the D.

12345678-9ABC-DEFG-HIJK-LMNOPQRSTUVW
12345678 = least significant 4 bytes of the timestamp in big endian order
9ABC     = middle 2 timestamp bytes in big endian
D        = 1 to signify a version 1 UUID
EFG      = most significant 12 bits of the timestamp in big endian

When you convert to binary, the best order for indexing would be: EFG9ABC12345678D + the rest.

You don't want to swap 12345678 to 78563412 because big endian already yields the best binary index byte order. However, you do want the most significant bytes moved in front of the lower bytes. Hence, EFG go first, followed by the middle bits and lower bits. Generate a dozen or so UUIDs with uuid() over the course of a minute and you should see how this order yields the correct rank.

select uuid(), 0
union 
select uuid(), sleep(.001)
union 
select uuid(), sleep(.010)
union 
select uuid(), sleep(.100)
union 
select uuid(), sleep(1)
union 
select uuid(), sleep(10)
union
select uuid(), 0;

/* output */
6eec5eb6-9755-11e4-b981-feb7b39d48d6
6eec5f10-9755-11e4-b981-feb7b39d48d6
6eec8ddc-9755-11e4-b981-feb7b39d48d6
6eee30d0-9755-11e4-b981-feb7b39d48d6
6efda038-9755-11e4-b981-feb7b39d48d6
6f9641bf-9755-11e4-b981-feb7b39d48d6
758c3e3e-9755-11e4-b981-feb7b39d48d6 

The first two UUIDs were generated closest in time. They only vary in the last 3 nibbles of the first block. These are the least significant bits of the timestamp, which means we want to push them to the right when we convert this to an indexable byte array. As a counter example, the last ID is the most current, but the KCD's swapping algorithm would put it before the 3rd ID (3e before dc, last bytes from the first block).

The correct order for indexing would be:

1e497556eec5eb6... 
1e497556eec5f10... 
1e497556eec8ddc... 
1e497556eee30d0... 
1e497556efda038... 
1e497556f9641bf... 
1e49755758c3e3e... 

See this article for supporting information: http://mysql.rjweb.org/doc.php/uuid

*** note that I don't split the version nibble from the high 12 bits of the timestamp. This is the D nibble from your example. I just throw it in front. So my binary sequence ends up being DEFG9ABC and so on. This implies that all my indexed UUIDs start with the same nibble. The article does the same thing.

bigh_29
  • 2,529
  • 26
  • 22
5

For those just stumbling across this, there is now a much better alternative as per research by Percona.

It consists of reorganising the UUID chunks for optimal indexing, then converting into binary for reduced storage.

Read the full article here

SleepyCal
  • 5,739
  • 5
  • 33
  • 47
  • I read that article before. I find it very interesting but then how should we perform a query if we want to filter by an ID which is binary? I guess we need to hex again and then apply the criteria. Is it so demanding? Why to store binary(16) (sure it is better than varchar(36)) instead of bigint of 8 bytes? – Maximus Decimus Sep 26 '16 at 21:03
  • 3
    There's an updated article from MariaDB which should answer your question https://mariadb.com/kb/en/mariadb/guiduuid-performance/ – SleepyCal Sep 28 '16 at 14:40
  • fwiw, UUIDv4 is completely random and needs no chunking. – Mahmoud Al-Qudsi Jan 12 '18 at 00:49
2

I would suggest using the functions below since the ones mentioned by @bigh_29 transforms my guids into new ones (for reasons I don't understand). Also, these are a little bit faster in the tests I did on my tables. https://gist.github.com/damienb/159151

DELIMITER |

CREATE FUNCTION uuid_from_bin(b BINARY(16))
RETURNS CHAR(36) DETERMINISTIC
BEGIN
  DECLARE hex CHAR(32);
  SET hex = HEX(b);
  RETURN LOWER(CONCAT(LEFT(hex, 8), '-', MID(hex, 9,4), '-', MID(hex, 13,4), '-', MID(hex, 17,4), '-', RIGHT(hex, 12)));
END
|

CREATE FUNCTION uuid_to_bin(s CHAR(36))
RETURNS BINARY(16) DETERMINISTIC
RETURN UNHEX(CONCAT(LEFT(s, 8), MID(s, 10, 4), MID(s, 15, 4), MID(s, 20, 4), RIGHT(s, 12)))
|

DELIMITER ;
Misbit
  • 347
  • 2
  • 20
-4

if you have a char/varchar value formatted as the standard GUID, you can simply store it as BINARY(16) using the simple CAST(MyString AS BINARY16), without all those mind-boggling sequences of CONCAT + SUBSTR.

BINARY(16) fields are compared/sorted/indexed much faster than strings, and also take two times less space in the database

George Hazan
  • 170
  • 1
  • 9
  • 2
    Running this query shows that CAST converts the uuid string to ASCII bytes: set @a = uuid(); select @a, hex( cast(@a AS BINARY(16))); I get 16f20d98-9760-11e4-b981-feb7b39d48d6 : 3136663230643938 2D 39373630 2D 3131 (spaces added for formatting). 0x31=ascii 1, 0x36=ascii 6. We even get 0x2D, which is the hyphen. This isn't much different than just storing the guid as a string, except that you truncate the string at the 16th character, which cleaves off the part of the ID that is machine specific. – bigh_29 Jan 08 '15 at 18:06
  • Yes, this is simply truncation. `select CAST("hello world, this is as long as uiid" AS BINARY(16));` produces `hello world, thi` – MD004 Feb 22 '17 at 21:30