7

I have a postgresql 9.1 database table, "en_US.UTF-8":

CREATE TABLE branch_language
(
    id serial NOT NULL,
    name_language character varying(128) NOT NULL,
    branch_id integer NOT NULL,
    language_id integer NOT NULL,
    ....
)

The attribute name_language contains names in various languages. The language is specified by the foreign key language_id.

I have created a few indexes:

/* us english */
CREATE INDEX idx_branch_language_2
    ON branch_language
    USING btree
    (name_language COLLATE pg_catalog."en_US" );

/* catalan */
CREATE INDEX idx_branch_language_5
    ON branch_language
    USING btree
    (name_language COLLATE pg_catalog."ca_ES" );

/* portuguese */
CREATE INDEX idx_branch_language_6
    ON branch_language
    USING btree
    (name_language COLLATE pg_catalog."pt_PT" );

Now when I do a select I am not getting the results I am expecting.

select name_language from branch_language
where language_id=42 -- id of catalan language
order by name_language collate "ca_ES" -- use ca_ES collation

This generates a list of names but not in the order I expected:

Aficions i Joguines
Agència de viatges
Aliments i Subministraments
Aparells elèctrics i il luminació
Art i Antiguitats
Articles de la llar
Bars i Restaurants
...
Tabac
Àudio, Vídeo, CD i DVD
Òptica

As I expected the last two entries to appear in different positions in the list.

Creating the indexes works. I don't think they are really necessary unless you want to optimize for performance.

The select statement however seems to ignore the part: collate "ca_ES".

This problem also exists when I select other collations. I have tried "es_ES" and "pt_PT" but the results are similar.

Henri
  • 875
  • 2
  • 14
  • 22

2 Answers2

3

I can't find a flaw in your design. I have tried.

Locales and collation

I revisited this question. Consider this test case on sqlfiddle. It seems to work just fine. I even created the locale ca_ES.utf8 in my local test server (PostgreSQL 9.1.6 on Debian Squeeze) and added the locale to my DB cluster:

CREATE COLLATION "ca_ES" (LOCALE = 'ca_ES.utf8');

I get the same results as can be seen in the sqlfiddle above.

Note that collation names are identifiers and need to be double-quoted to preserve CamelCase spelling like "ca_ES". Maybe there has been some confusion with other locales in your system? Check your available collations:

SELECT * FROM pg_collation;

Generally, collation rules are derived from system locales. Read about the details in the manual here. If you still get incorrect results, I would try to update your system and regenerate the locale for "ca_ES". In Debian (and related Linux distributions) this can be done with:

dpkg-reconfigure locales

NFC

I have one other idea: unnormalized UNICODE strings.

Could it be that your 'Àudio' is in fact '̀ ' || 'Audio'? That would be this character:

SELECT U&'\0300A';
SELECT ascii(U&'\0300A');
SELECT chr(768);

Read more about the acute accent in wikipedia.
You have to SET standard_conforming_strings = TRUE to use Unicode strings like in the first line.

Note that some browsers cannot display unnormalized Unicode characters correctly and many fonts have no proper glyph for the special characters, so you may see nothing here or gibberish. But UNICODE allows for that nonsense. Test to see what you got:

SELECT octet_length('̀A')  -- returns 3 (!)
SELECT octet_length('À')  -- returns 2

If that's what your database has contracted, you need to get rid of it or suffer the consequences. The cure is to normalize your strings to NFC. Perl has superior UNICODE-foo skills, you can make use of their libraries in a plperlu function to do it in PostgreSQL. I have done that to save me from madness.

Read installation instructions in this excellent article about UNICODE normalization in PostgreSQL by David Wheeler.
Read all the gory details about Unicode Normalization Forms at unicode.org.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • I checked the first character from "Àudio, Vídeo, CD i DVD", resulting in: select octet_length('À') returning 2. The same for "Òptica", select octet_length('Ò') resulting in 2. – Henri Oct 20 '11 at 12:55
  • In the file postgresql.conf under VERSION/PLATFORM COMPATIBILITY # - Previous PostgreSQL Versions - it reads #standard_conforming_strings = on – Henri Oct 20 '11 at 13:13
  • I thought and hoped it would be similar to http://download.oracle.com/docs/cd/E17952_01/refman-5.0-en/charset-collate.html – Henri Oct 20 '11 at 13:18
  • @Henri: Sorry, my idea was a bit of a shot in the dark. No luck. I am out of ideas. For all I know, your setup should work. – Erwin Brandstetter Oct 20 '11 at 14:28
  • Your remarks gave me some ideas. No solution to my problem however but a slight improvement to my application anyhow. This database is used in a Django application. I don't have plperl installed in postgresql so I can't use David Wheelers setup but I have changed the code in the models (ORM) so all unicode strings are stored NFC normalized. At least some improvement...:-) – Henri Oct 20 '11 at 15:17
  • @Henri: Glad it didn't all go to waste. ;) – Erwin Brandstetter Oct 20 '11 at 16:23
1

The problem is with accentuation. You have to use an AI (accent insensitve) collation. Check to see how to do it in Postgre. In some dbms, it is something like ca_ES_AI.