2

For my database I want natural ('A2'<'A12'), case-insensitive sorting, so for testing I created a new collation with

CREATE COLLATION tomscollation (provider = icu, locale = 'de-u-kn-true-ks-level2'); enter image description here

My end goal is to use such a collation as the default collation for the whole database so that every text column or index uses it by default.

This source notes that for PostgreSQL 15 non-deterministic collations are not yet supported at database level. This is where my confusion starts. Looking at my collation the database says its deterministic (see screenshot). However the documentation says that only byte-order collations can be deterministic but mine is clearly not byte-order.

So what am I missing here? Why am I able to create the following database and what are the side effects I might be unaware of?

CREATE DATABASE tomsdb TEMPLATE template0 LOCALE_PROVIDER icu ICU_LOCALE 'de-u-kn-true-ks-level2'

Tom
  • 503
  • 2
  • 13

2 Answers2

2

kn=true means that Unicode Locale Extension numeric ordering will be default.

"non-deterministic collations are not yet supported at database level." means that by default, select 'a' = 'A'; will return false.

CREATE DATABASE tomsdb TEMPLATE template0 
    LOCALE_PROVIDER icu
    ICU_LOCALE 'de-u-kn-true-ks-level2'
    LOCALE 'de_DE.utf8'
    ;

select 'a1' = 'A1'
union all
select 'a2' > 'A12'
union all
select 'A2' > 'A12';

it return:

 ?column?
----------
 f
 f
 f
(3 rows)

so on a database level, kn=true did work as expected. but ks-level2 should by default make select 'a1' = 'A1' return true.

so you need create collation to do case insensitive sorting, like following way:

CREATE COLLATION case_insensitive (
    provider = icu, locale = 'und-u-ks-level2', deterministic = false);

select 'a1' = 'A1' COLLATE case_insensitive;
jian
  • 4,119
  • 1
  • 17
  • 32
1

de-u-kn-true-ks-level2 is by definition not deterministic. If you use a non-deterministic collation in a context where you need a deterministic one (like as a database collation, or if you defined the collation as DETERMINISTIC = TRUE), PostgreSQL will break the tie by using memcmp if the two strings compare equal, but are not identical.

That means for example that if you use that collation for a database, you will get 'A1' < 'a1', because memcmp ranks the lower ASCII value of A before the higher one of a.

See the code of varstr_cmp() in src/backend/utils/adt/varlena.c:

/* Break tie if necessary. */
if (result == 0 &&
    (!mylocale || mylocale->deterministic))
    result = strcmp(a1p, a2p);

Since you asked in your comment: if you mark a collation as DETERMINISTIC, strings that are byte-wise different will never compare as equal.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • Thanks but the C code is too much for me. So if I use a non-deterministic collation when creating a database PostgreSQL will use some kind of fallback? Can you add one sentence in easier terms what memcmp does? Also what is the use of 'DETERMINISTIC = TRUE' when creating a collation then? If I can create whatever I want. – Tom Feb 28 '23 at 23:08
  • I reduced the C code to the essential part, added an explanatory sentence about `memcmp` and defined `DETERMINISTIC`. Better? – Laurenz Albe Mar 01 '23 at 06:44
  • Hm, just for my understanding: "de-u-kn-true-ks-level2" for "CREATE DATABASE ..." will in any case treat the collation as deterministic as CREATE DATABASE doesnt even have the concept of non-deterministic collations and "CREATE COLLATION..." with deterministic=true (the default) will also treat it deterministic, so wouldnt it be easier to understand to say that as of PostgreSQL 15 all database collations are deterministic and all "CREATE COLLATION ..." with deterministic=true are deterministic while with deterministic=false they _might_ be non-deterministic? – Tom Mar 19 '23 at 00:21
  • "...marking a collation as DETERMINISTIC allows it to compare strings that are different as equal." Dont you mean DETERMINISTIC=false allows it to compare string that are different as equal? – Tom Mar 19 '23 at 14:23
  • Sorry for another question I would be grateful if you could help me understand: For collation 'de-u-kn-true-ks-level2' and DETERMINISTIC=true why do some sort rules of the german language work and kn-true also works but _only_ for ks-level2 the "tie breaker" byte order has to come in? – Tom Mar 19 '23 at 22:04
  • "Yes" to your first comment. To your second: I fixed my addition; it was wrong. I don't understand your final comment. – Laurenz Albe Mar 20 '23 at 02:32
  • Thanks for your answer! With that comment I want to ask when exactly the byte order "tie breaker" is used or to say it differently when does the DETERMINISTIC=true make a difference? Only certain functions e.g. when using 'LIKE' or '='? Does it matter for 'ORDER BY'? etc... So "CREATE COLLATION tomscollation (provider = icu, locale = 'de-u-kn-true-ks-level2', deterministic = true);" is deterministic because of the deterministic=true, right? Minor edit suggestions: "if you mark a collation as DETERMINISTIC=true, strings that are different byte wise will never compare as equal. – Tom Mar 20 '23 at 10:14
  • "Deterministic" is respected everywhere except in a database collation. – Laurenz Albe Mar 20 '23 at 11:30