I would like to have a collation which orders the UTF-8 encoding of 0x1234 below of 0x1235 regardless of the character mapping in the Unicode standard. MySQL uses utf8_bin for this. MSSQL apparently http://msdn.microsoft.com/en-us/library/ms143350.aspx have BIN and BIN2 collations. While finding these were easy, I can't even find a list of collations PostgreSQL supports much less answer to this specific question.
3 Answers
The C locale will do. UTF-8 is designed so that byte ordering is also codepoint ordering. This is not trivial but consider how UTF-8 works:
Number range Byte 1 Byte 2 Byte 3 0000-007F 0xxxxxxx 0080-07FF 110xxxxx 10xxxxxx 0800-FFFF 1110xxxx 10xxxxxx 10xxxxxx
When sorting binary data aka C locale, the first non-equal byte will determine the ordering. What we neeed to see that if two numbers encoded into UTF-8 differ then the first non-equal byte will be lower for the lower value. If the numbers are in different ranges then the first byte will indeed be lower for the lower number. Within the same range, the order is determined by literally the same bits as without encoding.

- 11,270
- 7
- 55
- 129
-
That's a code point sort, which is competely useless on Unicode. How do you make it do a proper alphabetic sort using the sort algorithm required by Unicode in its the Unicode Collation Algorithm? – tchrist Oct 15 '11 at 19:37
-
4@tchrist: That is not the question. – Peter Eisentraut Oct 15 '11 at 20:54
Sort order of text depends on lc_collate
(not on the system locale!). The system locale only serves as a default when creating the db cluster if you don't provide another locale.
The behaviour you are expecting only works with locale C
. Read all about it in the fine manual:
The C and POSIX collations both specify "traditional C" behavior, in which only the ASCII letters "A" through "Z" are treated as letters, and sorting is done strictly by character code byte values.
Emphasis mine. PostgreSQL 9.1 has a couple of new features for collation. Might be exactly what you are looking for.

- 605,456
- 145
- 1,078
- 1,228
-
How do you make it do an alphabetic sort instead of a code point sort? You know, so that it uses the Unicode Collation Algorithm. Otherwise you will never get an alphabetic sort on Unicode text. – tchrist Oct 15 '11 at 19:36
-
@tchrist: Normally you have `lc_collate` set to your locale. Example: in England, you would probably have `lc_collate` set to `en_EN.utf8`. Try `SHOW lc_collate;` to see your setting. Follow the link in my answer for more information. – Erwin Brandstetter Oct 15 '11 at 22:46
Postgres uses the collation defined by the system locale on cluster creation.
You might try to ORDER BY encode(column,'hex')

- 1,889
- 1
- 10
- 19