3

I'm searching for a method to remove diacritics and other letter marks in a text and simplify it in a way that it is a good fit for a text search index.

For removing the diacritics, I already found these:

  • questions for PHP: 1, 2
  • question for Java: 1, related: 2
  • question for Bash: 1
  • questions for .Net: 1, 2
  • question for Javascript: 1
  • question for Python: 1

I was wondering about a generic solution, language independent. (Also, this reference list might be useful for some.)

Removing the diacritics works for äöüò, etc. But I also want:

  • ø → o
  • Я → R
  • Ł → L
  • ɲ → n
  • æ → a (it could also be "ae" but in my case, "a" makes more sense because I also want to replace "ae" by "a")

For example, I want to index the name Røyksopp which sometimes also occurs as Röyksopp just under the simplified name Royksopp. Or KoЯn should be KoRn.

Community
  • 1
  • 1
Albert
  • 65,406
  • 61
  • 242
  • 386

2 Answers2

3

Some ICU magic:

echo "ë ö ø Я Ł ɲ æ å ñ 開 당" | uconv -x any-name | perl -wpne 's/ WITH [^}]+//g;' | uconv -x name-any | uconv -x any-latin -t iso-8859-1 -c | uconv -f iso-8859-1 -t ascii -x latin-ascii -c

yields

e o o A L n ae a n ki dang

This uses the cmdline tool uconv, but the same can be done with ICU's Java or C or C++ API, and ICU has bindings for almost any language.

Note Я -> A because that is the correct behavior. What you want is not how Unicode defines that character - blame KoЯn for abusing it.

Tino Didriksen
  • 2,215
  • 18
  • 21
  • I just found out: there are two Яs: Я 'CYRILLIC SMALL LETTER YA' (U+044F) and ᴙ 'LATIN LETTER SMALL CAPITAL REVERSED R' (U+1D19). Maybe I should include several variants in my search index. – Albert Nov 27 '12 at 15:29
  • FYI: `uconv` is in the `icu4c` Homebrew package but must be manually linked to /usr/local/bin ([source](https://apple.stackexchange.com/questions/201590/uconv-on-mac-os-x-anywhere)) – nloveladyallen Dec 05 '17 at 23:17
  • This wasn't in the original question, but this fails on inputs containing diacritics without letters, as in `´` (acute accent) and `¨` (umlaut/diaeresis) – nloveladyallen Dec 05 '17 at 23:37
  • @nloveladyallen, huh, you're right. Luckily, just add -c to the final command and those go away entirely. Will edit answer... – Tino Didriksen Dec 06 '17 at 21:01
0

In the Python-specific question, there was one generic solution presented to at least remove the diacritics:

  • convert the unicode string to its long normalized form (with a separate character for letters and diacritics)
  • remove all the characters whose unicode type is "diacritic"

This doesn't work for ø, though.

Albert
  • 65,406
  • 61
  • 242
  • 386