ORDER BY with diacritic in Postgres

Question

I need to select data from table and sort them with ORDER BY clause. The problem is the column contains text data with czech diacritic. I cannot use COLLATE, because the DB is part of postgres cluster which was created with lc_collate = en_US.UTF-8 and I cannot afford downtime caused by recreating the cluster with correct lc_collate.

Sample data:

CREATE TABLE test (
  id serial PRIMARY key,
  name text
);

INSERT INTO test (name) VALUES ('Žoo'), ('Zoo'), ('ŽOO'), ('ZOO'),
  ('ŽoA'), ('ŽóA'), ('ŽoÁ'), ('ŽóÁ');

Ideal output:

SELECT * FROM test ORDER BY name COLLATE "cs_CZ.utf8";
id | name
----+------
  2 | Zoo
  4 | ZOO
  5 | ŽoA
  7 | ŽoÁ
  6 | ŽóA
  8 | ŽóÁ
  1 | Žoo
  3 | ŽOO
(8 rows)

Here I found kind of solution:

SELECT * FROM test ORDER BY name USING ~<~;
id | name
----+------
  4 | ZOO
  2 | Zoo
  3 | ŽOO
  5 | ŽoA
  1 | Žoo
  7 | ŽoÁ
  6 | ŽóA
  8 | ŽóÁ
(8 rows)

The result is close enough (for my usage) - the caroned letters are AFTER the non-caroned.

My slightly off-topic Postgresql anabasis with ~<~ operator

edit: turned into new question.

Back to the question: Is there other solution to get the ideal order besides recreating the postgres cluster with correct locale?

Also some insght on the ~<~ operator would be nice.

BTW. You do not need downtime to use collation as in `ORDER BY name COLLATE "cs_CZ.utf8"`. You do not need downtime to add additional collation to existing postgres cluster: http://www.postgresql.org/docs/current/static/sql-createcollation.html — Ihor Romanchenko, Feb 25 '16 at 14:22
Thank you for pointing that out. The downtime is unnecessary, all I needed was to add new locale on the server by adding line `cs_CZ.UTF-8 UTF-8` to `/etc/locale.gen` and regenerating locale with `locale-gen`. Then in Postgres: `CREATE COLLATION "cs_CZ.utf8" ( locale = 'cs_CZ.UTF-8' );` — Vojtech Sokol, Mar 04 '16 at 23:29

score 0 · Answer 1 · answered Feb 25 '16 at 14:13

Im not sure if I understand the question, because look like you already found a solution. The only think I can suggest is you can add a new field czechName with the correct collate

http://www.postgresql.org/docs/current/static/sql-altertable.html

ADD [ COLUMN ] column_name data_type [ COLLATE collation ] [ column_constraint [ ... ] ]

score 0 · Answer 2 · answered Feb 25 '16 at 14:23

0

Similar to what Juan Carlos Oropeza suggests, you can try to change the collation of the column with:

ALTER TABLE test ALTER COLUMN "name" TYPE text COLLATE 'cs_CZ.utf8';

Reference: http://www.postgresql.org/docs/current/static/sql-altertable.html

answered Feb 25 '16 at 14:23

Diego Augusto Molina

231
2
7

score 0 · Accepted Answer · answered Apr 19 '16 at 20:58

As @Igor pointed out in his comment, there is no need to recreate the postgres cluster with different lc_collate and deal with the caused downtime.

Exact steps that solved the problem were:

add/uncomment line cs_CZ.UTF-8 UTF-8 in /etc/locale.gen
generate the new locale:

# locale-gen
define new collation in postgres:

CREATE COLLATION "cs_CZ.utf8" ( locale = 'cs_CZ.UTF-8' );

ORDER BY with diacritic in Postgres

3 Answers3

Linked