12

Just take a look, please:

WITH toks AS (
  SELECT tok
    FROM
      unnest('{ь, а, чь, ча, чль, чла}'::text[]) AS tok
  ORDER BY tok COLLATE "uk_UA"
)
SELECT ROW_NUMBER() OVER() AS "#", tok FROM toks
ORDER BY tok COLLATE "uk_UA"

PostgreSQL 9.3 (ubuntu) gives me this result:

 # | tok 
---+-----
 1 | а
 2 | ча
 3 | чль
 4 | чла
 5 | чь
 6 | ь
(6 rows)

Here rows 1, 2, 5 and 6 are sorted properly ("ь" goes after "а") while rows 3 and 4 are sorted wrongly ("а" goes after "ь").

All letters are Cyrillic, I've checked so many times.

Please, what's wrong and how to workaround .(

UPDATE: this is a bug which was fixed in mainstream recently: https://sourceware.org/bugzilla/show_bug.cgi?id=17293

UPDATE2: Please note my own answer below.

Cœur
  • 37,241
  • 25
  • 195
  • 267
brownian
  • 435
  • 3
  • 13
  • What is the encoding of your database? Seems that it matters... – Tagir Valeev Jun 30 '15 at 10:39
  • @TagirValeev Encoding: UTF8, Collation: uk_UA.UTF-8 (both as shown by pgadmin3). – brownian Jun 30 '15 at 10:40
  • 3
    PostgreSQL uses the order of the locale from the operating system. If you use the command-line `sort` utility with the same collation and the same data, that should output the same order. – Daniel Vérité Jun 30 '15 at 10:45
  • @DanielVérité yes, `sort` gives somewhat different but wrong result also: `ь > а` but `ть < та`. So, there is somewhere an issue with system's locale? – brownian Jun 30 '15 at 11:06
  • @DanielVérité no, if I `export LANG=uk_UA` --- `sort`ing seems to be ok; there is a problem with default `LANG=uk_UA.UTF-8`. – brownian Jun 30 '15 at 11:17
  • But then what is the encoding? Maybe it's sorting an UTF-8 file with KOI8-U rules, which may sort as intended by happenstance. – Daniel Vérité Jun 30 '15 at 11:21
  • What operating system and version? – Craig Ringer Jun 30 '15 at 11:35
  • @CraigRinger it looks like this bug was fixed 2015-05-26: https://sourceware.org/bugzilla/show_bug.cgi?id=17293 – brownian Jun 30 '15 at 11:39

2 Answers2

7

PostgreSQL relies on the operating system's locale to sort.

See how Ubuntu 14.04 sorts that list:

# locale-gen uk_UA.UTF-8
Generating locales...
  uk_UA.UTF-8... done
Generation complete.

# cat >file
ь
а
чь
ча
чль
чла

# LC_ALL=uk_UA.UTF-8 sort file
а
ча
чль
чла
чь
ь

In the comments you say it's different but what I get here is exactly the same order as your query.

Indeed чль comes before чла which intuitively is weird but I don't know cyrillic.

You may look at /usr/share/i18n/locales/uk_UA for the definition of the locale, and bring it up as an ubuntu bug of the locales packages.

Daniel Vérité
  • 58,074
  • 15
  • 129
  • 156
3

So, the solutions has been completed in these steps:

  1. Searched the Internet and http://linux.org.ua/ for any information; found http://linux.org.ua/cgi-bin/yabb/YaBB.pl?num=1407783417/6#6
  2. Starred at bug report: https://sourceware.org/bugzilla/show_bug.cgi?id=17293, fixed 2015-05-26
  3. Checked glibc version (2.19 now)
  4. Grabbed a patch; edited the patch (removed the section for Makefile)
  5. Backed /usr/share/i18n/locales/uk_UA up
  6. Patched it with [cd /usr/share/i18n/;] patch --dry-run -p2 < locales_uk_UA_softsign.diff --- then with no --dry-run.
  7. locale-gen
  8. service postgresql restart
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
brownian
  • 435
  • 3
  • 13