Postgres UNACCENT for character with more than 1 diacritic

Question

The UNACCENT function can strip diacritics off characters. However, in my case, it can only strip characters with 1 diacritic, for e.g.

Thành
Supermän
äää

For characters with more than 1 diacritics, UNACCENT does nothing, for e.g.

Hồ
ầ
phố

Is there a way to let Postgres strip the accents from these characters?

Thanks

score 4 · Accepted Answer · answered Apr 08 '17 at 11:30

4

PostgreSQL's unaccent module does not use Unicode normalization, but only a simple search-and-replace dictionary. The default dictionary, unaccent.rules, does not contain these Vietnamese characters, thus nothing is done.

You could create your own unaccent dictionary though. As explained in the documentation:

Create a text file vietnamese.rules with content like
```
ầ  a
Ầ  A
ồ  o
Ồ  O
```
Move vietnamese.rules into the folder $SHAREDIR/tsearch_data/ (usually /usr/share/postgresql/tsearch_data)

Run the function as

SELECT unaccent('vietnamese', 'Hồ ầ phố');
--              ^~~~~~~~~~~~~

answered Apr 08 '17 at 11:30

kennytm

510,854
105
1,084
1,005

I was afraid to create my own rule since I don't have access to the DB computer, but I guess there is no avoiding it. – Khanetor Apr 08 '17 at 12:23
I'm sorry, but I got this error ` ERROR: text search dictionary "vietnamese" does not exist LINE 1: SELECT unaccent('vietnamese', 'Hồ ầ phố'); ` – Hom nom nom nom ... Jan 21 '19 at 13:23
@Homnomnomnom... Just submitted an edit addressing your problem. You need to run the appropriate `CREATE TEXT SEARCH DICTIONARY` command. – cstork Nov 13 '21 at 14:30

score 3 · Answer 2 · answered Dec 09 '19 at 05:29

You can create a new function to do unaccent job like below:

CREATE OR REPLACE FUNCTION public.vietnamese_unaccent(text)
 RETURNS text
 LANGUAGE plpgsql
AS $function$
DECLARE
    input_string text := $1;
BEGIN

input_string := translate(input_string, 'áàãạảAÁÀÃẠẢăắằẵặẳĂẮẰẴẶẲâầấẫậẩÂẤẦẪẬẨ', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa');
input_string := translate(input_string, 'éèẽẹẻEÉÈẼẸẺêếềễệểÊẾỀỄỆỂ', 'eeeeeeeeeeeeeeeeeeeeeeee');
input_string := translate(input_string, 'íìĩịỉIÍÌĨỊỈ', 'iiiiiiiiiii');
input_string := translate(input_string, 'óòõọỏOÓÒÕỌỎôốồỗộổÔỐỒỖỘỔơớờỡợởƠỚỜỠỢỞ', 'ooooooooooooooooooooooooooooooooooo');
input_string := translate(input_string, 'úùũụủUÚÙŨỤỦưứừữựửƯỨỪỮỰỬ', 'uuuuuuuuuuuuuuuuuuuuuuu');
input_string := translate(input_string, 'ýỳỹỵỷYÝỲỸỴỶ', 'yyyyyyyyyyy');
input_string := translate(input_string, 'dđĐD', 'dddd');

return input_string;
END;
$function$

It's works for me!

score 0 · Answer 3 · answered Jul 24 '21 at 07:32

If you use from version 13 onwards, the Postgresql already support that function here:

select normalize('hồ, phố, ầ', NFC) → 'ho, pho, a' -- NFC (the default), NFD, NFKC, or NFKD.

Document: https://www.postgresql.org/docs/13/functions-string.html

Postgres UNACCENT for character with more than 1 diacritic

3 Answers3