9

I'm building a small app that includes Esperanto words in my database, so I have words like ĉapelojn and brakhorloĝo, with "special" characters.

Using PostgreSQL 9.4.4 I have a words table with the following schema:

lingvoj_dev=# \d words
                                      Table "public.words"
   Column    |            Type             |                     Modifiers
-------------+-----------------------------+----------------------------------------------------
 id          | integer                     | not null default nextval('words_id_seq'::regclass)
 translated  | character varying(255)      |
 meaning     | character varying(255)      |
 times_seen  | integer                     |
 inserted_at | timestamp without time zone | not null
 updated_at  | timestamp without time zone | not null
Indexes:
    "words_pkey" PRIMARY KEY, btree (id)

But the following query gives some strange output:

lingvoj_dev=# SELECT w."translated" FROM "words" AS w ORDER BY w."translated" desc limit 10; 
translated
------------
 ĉu
 ŝi
 ĝi
 ĉevaloj
 ĉapelojn
 ĉapeloj
 ĉambro
 vostojn
 volas
 viro
(10 rows)

The ordering is inconsistent - I'd be okay with all of the words starting with special characters being at the end, but all of the words starting with ĉ should be grouped together and they're not! Why do ŝi and ĝi come in between ĉu and ĉevaloj?

The server encoding is UTF8, and the collation is en_AU.UTF-8.

edit: It looks like it's sorting all of the special characters as equivalent - it's ordering correctly based on the second character in each word. How do I make PostgreSQL see that ĉ, ŝ and ĝ are not equivalent?

sevenseacat
  • 24,699
  • 6
  • 63
  • 88
  • 1
    Have a look at http://stackoverflow.com/q/18932922/372239 – Toto Sep 18 '15 at 11:27
  • @Toto interesting! I'm on OSX and it doesn't seem to have any kind of Esperanto (EO) encoding - I thought UTF-8 would cover it. I may be out of luck, but I'll keep searching. – sevenseacat Sep 18 '15 at 11:35

1 Answers1

10

I'd be okay with all of the words starting with special characters being at the end...

Use collate "C":

SELECT w."translated" 
FROM "words" AS w 
ORDER BY w."translated" collate "C" desc limit 10; 

See also Different behaviour in “order by” clause: Oracle vs. PostgreSQL

The query can be problematic when using ORM. The solution may be to recreate the database with the LC_COLLATE = C option, as suggested by the OP in the comment. There is one more option - change the collation for a single column:

ALTER TABLE "words" ALTER COLUMN "translated" TYPE text COLLATE "C";
klin
  • 112,967
  • 15
  • 204
  • 232
  • fantastic! Seems to work a treat - I'll put some more data in to double check, then give it a big green tick. – sevenseacat Sep 18 '15 at 13:49
  • another option (which is more compatible with the ORM I was using) is to dump the database, recreate it with the option LC_COLLATE=C, and then re-import the data. – sevenseacat Sep 21 '15 at 09:41