5

I have a data table and I would like to filter the columns. For example, to search for a user by his fullname.

However, I would like to allow the user to enter the search phrase case-independent and accents-independent.

So I have checked these (and more) sources and questions:

I thought the nondeterministic collations might finally be the right way how to achieve that, but unfortunately I don't know how:

  • combine case_insensitive and ignore_accents into one collation
  • how to allow searching only by substring in such a WHERE (e.g., find "Jóhn Doe" only by the string "joh") as nondeterministic collations do not support LIKE or regular expressions
  • which index to use

I would be very grateful for any advice on how to finally deal with this type of problem.

Thanks!

Jeffery Grajkowski
  • 3,982
  • 20
  • 23
Lunack
  • 343
  • 1
  • 5
  • 14

1 Answers1

5

Creating case and accent insensitive ICU collations is pretty simple:

CREATE COLLATION english_ci_ai (
   PROVIDER = icu,
   DETERMINISTIC = FALSE,
   LOCALE = "en-US-u-ks-level1"
);

Or, equivalently (that syntax also works wil old ICU versions:

CREATE COLLATION english_ci_ai (
   PROVIDER = icu,
   DETERMINISTIC = FALSE,
   LOCALE = "en-US@colStrength=primary"
);

See the ICU documentation for details and my article for a detailed discussion.

But your problem is that you want substring search. So you should create a trigram index:

CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE EXTENSION IF NOT EXISTS unaccent;

CREATE INDEX ON tab USING gin (unaccent(doc) gin_trgm_ops);

Then you can search like this:

SELECT * FROM tab
WHERE unaccent(doc) ILIKE unaccent('%joh%');

Note that you have to force a minimal length of 4 or so on the search string if you want that to be efficient.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • Thank you, your article helped me to understand how creating a custom ICU collation works. However, unfortunately my problem with not being able to search with a non-deterministic ICU according to a portion of the string persists. – Lunack Mar 29 '22 at 15:10
  • 2
    Thanks for the update, I thought I'd have to stick with this unaccent solution. I just have to create an IMMUTABLE version of unaccent as mentioned in my first link. Thank you! – Lunack Mar 29 '22 at 18:56