2

In PostgreSQL 12+ it's possible to use nondeterministic ICU collations in order to implement accent insensitive search (see this StackOverflow answer).

I want to implement accent insensitive search, while still using accent sensitive sorting.


Consider the following example:

I have a table named country_region with a single column: "symbol". It stores the following values:

  • lubuskie
  • łódzkie
  • małopolskie

The values have been listed in the expected order, as the Polish alphabet order is L -> Ł -> M.

The below SQL query should return the "łódzkie" field, ignoring the diacritics:

SELECT * FROM country_region WHERE symbol = 'lodzkie'

The below SQL query should return the values ordered per the alphabet (lubuskie -> łódzkie -> małopolskie):

SELECT * FROM country_region ORDER BY symbol

What I have tried

I have tried using 2 different ICU locales.

In the below collation Unicode locale search works as expected, but rows are sorted in the wrong order:

CREATE COLLATION ignore_accent (provider = icu, locale = 'und-u-ks-level1-kc-true', deterministic = FALSE);
CREATE INDEX test_index ON country_region(symbol COLLATE ignore_accent);

In the below collation Polish locale sorting works as expected, but diacritics aren't ignored in search:

CREATE COLLATION ignore_accent (provider = icu, locale = 'pl-u-ks-level1-kc-true', deterministic = FALSE);
CREATE INDEX test_index ON country_region(symbol COLLATE ignore_accent);

Is it possible to solve this with PostgreSQL? Thanks in advance!

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563

1 Answers1

1

I'd say that that is an ICU bug in the definition of the Polish collation. A web search in their bug database turned up this, which links to that, which seems to be open, but I am not certain whether that is the problem you encounter or not.

As a workaround, you could define the column using the normal, deterministic collation "pl-PL-x-icu" and be explicit about the collation in equality comparisons:

CREATE COLLATION ignore_accent (
   provider = icu,
   locale = 'und-u-ks-level1-kc-true',
   deterministic = FALSE
);

CREATE INDEX test_index ON country_region(symbol COLLATE ignore_accent);

SELECT * FROM country_region WHERE symbol COLLATE ignore_accent = 'lodzkie';

For all other purposes, the normal collation will do.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263