I'm searching for a method to remove diacritics and other letter marks in a text and simplify it in a way that it is a good fit for a text search index.
For removing the diacritics, I already found these:
- questions for PHP: 1, 2
- question for Java: 1, related: 2
- question for Bash: 1
- questions for .Net: 1, 2
- question for Javascript: 1
- question for Python: 1
I was wondering about a generic solution, language independent. (Also, this reference list might be useful for some.)
Removing the diacritics works for äöüò, etc. But I also want:
- ø → o
- Я → R
- Ł → L
- ɲ → n
- æ → a (it could also be "ae" but in my case, "a" makes more sense because I also want to replace "ae" by "a")
For example, I want to index the name Røyksopp which sometimes also occurs as Röyksopp just under the simplified name Royksopp. Or KoЯn should be KoRn.