Postgres ordering of UTF-8 characters

Question

I'm building a small app that includes Esperanto words in my database, so I have words like ĉapelojn and brakhorloĝo, with "special" characters.

Using PostgreSQL 9.4.4 I have a words table with the following schema:

lingvoj_dev=# \d words
                                      Table "public.words"
   Column    |            Type             |                     Modifiers
-------------+-----------------------------+----------------------------------------------------
 id          | integer                     | not null default nextval('words_id_seq'::regclass)
 translated  | character varying(255)      |
 meaning     | character varying(255)      |
 times_seen  | integer                     |
 inserted_at | timestamp without time zone | not null
 updated_at  | timestamp without time zone | not null
Indexes:
    "words_pkey" PRIMARY KEY, btree (id)

But the following query gives some strange output:

lingvoj_dev=# SELECT w."translated" FROM "words" AS w ORDER BY w."translated" desc limit 10; 
translated
------------
 ĉu
 ŝi
 ĝi
 ĉevaloj
 ĉapelojn
 ĉapeloj
 ĉambro
 vostojn
 volas
 viro
(10 rows)

The ordering is inconsistent - I'd be okay with all of the words starting with special characters being at the end, but all of the words starting with ĉ should be grouped together and they're not! Why do ŝi and ĝi come in between ĉu and ĉevaloj?

The server encoding is UTF8, and the collation is en_AU.UTF-8.

edit: It looks like it's sorting all of the special characters as equivalent - it's ordering correctly based on the second character in each word. How do I make PostgreSQL see that ĉ, ŝ and ĝ are not equivalent?

@Toto interesting! I'm on OSX and it doesn't seem to have any kind of Esperanto (EO) encoding - I thought UTF-8 would cover it. I may be out of luck, but I'll keep searching. — sevenseacat, Sep 18 '15 at 11:35

klin · Accepted Answer · 2021-09-10T16:48:53.947

10

I'd be okay with all of the words starting with special characters being at the end...

Use collate "C":

SELECT w."translated" 
FROM "words" AS w 
ORDER BY w."translated" collate "C" desc limit 10;

See also Different behaviour in “order by” clause: Oracle vs. PostgreSQL

The query can be problematic when using ORM. The solution may be to recreate the database with the LC_COLLATE = C option, as suggested by the OP in the comment. There is one more option - change the collation for a single column:

ALTER TABLE "words" ALTER COLUMN "translated" TYPE text COLLATE "C";

edited Sep 10 '21 at 16:48

answered Sep 18 '15 at 13:19

klin

112,967
15
204
232

fantastic! Seems to work a treat - I'll put some more data in to double check, then give it a big green tick. – sevenseacat Sep 18 '15 at 13:49
another option (which is more compatible with the ORM I was using) is to dump the database, recreate it with the option LC_COLLATE=C, and then re-import the data. – sevenseacat Sep 21 '15 at 09:41

Postgres ordering of UTF-8 characters

1 Answers1

Linked