10

Right now I can choose Encoding : UTF8 when creating a new DB in pgAdmin4 GUI.

But, there is no option to choose utf8_general_ci as collation or character type. When I do select * from pg_collation; I dont see any collation relevant to utf8_general_ci.

Coming from a mySQL background I am confused. Do I have to install utf8-like ( eg utf8_general_ci, utf8_unicode_ci) collation in my PostgreSQL 10 or windows10?

I just want to have the equivalent of mySQL collation utf8_general_ci to PostgreSQL.

Thank you

slevin
  • 4,166
  • 20
  • 69
  • 129
  • 1
    Note to readers: Postgres gained support for [ICU](https://en.wikipedia.org/wiki/International_Components_for_Unicode) collations in version 10 and later. See [post](https://www.2ndquadrant.com/en/blog/icu-support-postgresql-10/) by Peter Eisentraut. – Basil Bourque Jul 16 '19 at 21:05

1 Answers1

11

utf8 is an encoding (how to represent unicode characters as a series of bytes), not a collation (which character goes before which).

I think the Postgres 10 collation equivalent for utf8_general_ci (or more modern utf8_unicode_ci) is called und-x-icu - this is an undefined collation (not defined for any real world language) provided by an ICU library. This collation would sort quite reasonably characters from most languages.

ICU support is a new feature added in PostgreSQL 10, so this collation isn't available for older PostgreSQL versions or when it's disabled during compilation. Before that Postgres was using operating system provided collation support, which differs between operating systems.

Basil Bourque
  • 303,325
  • 100
  • 852
  • 1,154
Tometzky
  • 22,573
  • 5
  • 59
  • 73
  • Happy New Year. Yes I know the difference between encoding-collation. But thinking in mySQL terms, I meant any utf8-like collation (eg `utf8_general_ci` or `utf8_unicode_ci`). I edited my OP to be better written. Anyway, if `und-x-icu` is an undefined collation , then I could also use C collation, that is also an undefined collation , right? Thank you – slevin Jan 01 '18 at 11:11
  • 2
    I made some tests and it seems that in current version (10.1) ICU support is not as complete as I assumed. It seems that it's impossible to create a database with `und-x-icu` locale - only set it for some columns explicitly. So it seems that until it's improved you'd probably have to use `create database datname template='template0' encoding='UTF8' lc_ctype='C' lc_collate='C';` or other OS-provided locale. – Tometzky Jan 02 '18 at 00:24
  • 3
    Hi again. I have PostgreSQL 10.1, compiled by Visual C++ build 1800, 64-bit in win10 laptop. I just did `CREATE DATABASE test WITH OWNER = postgres TEMPLATE = template0 ENCODING = 'UTF8' LC_COLLATE = 'und-x-icu' LC_CTYPE = 'und-x-icu' TABLESPACE = pg_default CONNECTION LIMIT = -1;` and it worked. I can also see the properties in the pgAdmin4 GUI. – slevin Jan 02 '18 at 14:10
  • Also, I have to thank you because I didnt know about the ICU, you inform about it and it looks awesome , compared to C. It is not OS-depended, it is future proof, is based on Unicode standard for sorting and comparison (much like utf8_unicode_ci) , it implements the CLDR Collation Algorithm which is an extension of the Unicode Collation Algorithm (UCA) which is an extension of ISO 14651. [here](https://www.postgresql.org/docs/current/static/collation.html) , [here](https://blog.2ndquadrant.com/icu-support-postgresql-10/) and [here](http://userguide.icu-project.org/collation) – slevin Jan 02 '18 at 14:16
  • But, if you want, you can post a regular answer explaining the bugs you found. – slevin Jan 02 '18 at 14:18
  • 1
    Hello, testing on PG 11.1, Win10 x64 `CREATE DATABASE dbname WITH OWNER = postgres TEMPLATE = template0 ENCODING = 'UTF8' LC_COLLATE = 'und-x-icu' LC_CTYPE = 'und-x-icu' TABLESPACE = pg_default CONNECTION LIMIT = -1; COMMENT ON DATABASE dbname IS '...';` and it works. – XCanG Nov 28 '18 at 09:50
  • `und-x-icu` seems to be supported only with 13+. I tried to use it and then `ILIKE` queries were case-*sensitive*. Using default collation made `ILIKE` search case-insensitive as expected. – akostadinov Dec 23 '21 at 13:26