51

What I want to do is to remove all accents and umlauts from a string, turning "lärm" into "larm" or "andré" into "andre". What I tried to do was to utf8_decode the string and then use strtr on it, but since my source file is saved as UTF-8 file, I can't enter the ISO-8859-15 characters for all umlauts - the editor inserts the UTF-8 characters.

Obviously a solution for this would be to have an include that's an ISO-8859-15 file, but there must be a better way than to have another required include?

echo strtr(utf8_decode($input), 
           'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ',
           'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');

UPDATE: Maybe I was a bit inaccurate with what I try to do: I do not actually want to remove the umlauts, but to replace them with their closest "one character ASCII" equivalent.

BlaM
  • 28,465
  • 32
  • 91
  • 105
  • 2
    Keep in mind that the string you produce will not necessarily have the same meaning as the original string, as discussed in this [similar question](http://stackoverflow.com/questions/140422/how-do-i-translate-8bit-characters-into-7bit-characters-ie-220-to-u). It's a serviceable approach for cleaning file names, but probably not something you'd want to do if you are planning to display your new string as text. – Dave DuPlantis Oct 01 '08 at 18:57
  • 2
    Thanks for the hint. However the resulting string will be used as a simplified version fallback for search if "binary search" fails. Even more simplifications will be applied after this one - to allow illiterates to still find what they are looking for :) – BlaM Oct 05 '08 at 00:40
  • 2
    There actually is a valid reason to do it for displayed characters. Generation of HTML 4.1 compliant id attributes for navigation menus. For example, if I have

    Für Elise

    and I want to generate an id anchor above it, is the best I can do and still be compliant with html 4.1 which may be necessary for some older browsers.
    – Alice Wonder Nov 14 '11 at 22:58

8 Answers8

59
iconv("utf-8","ascii//TRANSLIT",$input);

Extended example

Vinko Vrsalovic
  • 330,807
  • 53
  • 334
  • 373
  • 4
    I had to add "setlocale(LC_ALL, 'en_US');" (sadly no locals for Germany seem to be available on my machine :( ), but then it works. Great! :) – BlaM Oct 01 '08 at 15:52
  • 16
    Why does this solution return `"o` for `ö` on my machine and on the examples in the [php reference](http://www.php.net/manual/en/function.iconv.php#105507) it returns `oe`? – spikey May 14 '12 at 12:11
  • 4
    This does not work for Cyrillic characters. They are converted to ? question marks instead. – Zebooka Jul 12 '12 at 17:51
  • 2
    This bombs with a value of false and gives me a notice that illegal characters were encountered... – Matt Apr 25 '13 at 19:32
  • 2
    To spikey's comment: if you set your locale to de_*.UTF8 (de_DE.UTF8, de_CH.UTF8, etc.), then umlauts will be converted to *e (ü->ue). Set it to en_US..UTF8 to get the desired effect. – Michał Leon Dec 19 '13 at 15:46
  • I have the same problem as spikey, setlocale stuff doesn't help also. – edditor Sep 26 '14 at 08:46
  • setlocale() depends on your operating system, is not thread-safe and wreaks havoc if you do it wrong (such as treating commas as periods in conversions). Either be careful (using LC_CTYPE instead of LC_ALL in this case) or stay away from it unless you know exactly what you're doing. – PeerBr Nov 13 '14 at 12:30
  • Use `"ascii//translit//ignore"` to prevent "illegal characters encountered" error. – Jose Manuel Abarca Rodríguez Mar 21 '19 at 16:41
  • 1
    If `iconv()` with `ASCII//TRANSLIT` doesn't work for you with German umlauts (ä/ö/ü => ae/oe/ue, despite setting `setlocale()` to a German utf8 locale, [this answer to another question](https://stackoverflow.com/questions/50412085/how-to-make-textslug-convert-german-umlauts-properly#answer-50415750) was the solution for me, using [`transliterator_transliterate()`](https://www.php.net/manual/en/transliterator.transliterate.php) with `de-ASCII` supplied via the transliterator build string. – Constantin Groß Aug 02 '21 at 21:06
33

A little trick that doesn't require setting locales or having huge translation tables:

function Unaccent($string)
{
    if (strpos($string = htmlentities($string, ENT_QUOTES, 'UTF-8'), '&') !== false)
    {
        $string = html_entity_decode(preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|tilde|uml);~i', '$1', $string), ENT_QUOTES, 'UTF-8');
    }

    return $string;
}

The only requirement for it to work properly is to save your files in UTF-8 (as you should already).

Alix Axel
  • 151,645
  • 95
  • 393
  • 500
9

you can also try this

$string = "Fóø Bår";
$transliterator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);
echo $normalized = $transliterator->transliterate($string);

but you need to have http://php.net/manual/en/book.intl.php available

gabo
  • 1,538
  • 14
  • 15
1

Okay, found an obvious solution myself, but it's not the best concerning performance...

echo strtr(utf8_decode($input), 
           utf8_decode('ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
           'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
BlaM
  • 28,465
  • 32
  • 91
  • 105
  • 2
    It's not the best in terms of performance and it also produces incorrect result. Letters like Œ, Æ, etc. should decompose to two letters, not to one. – laurent Dec 23 '12 at 07:41
  • 2
    You have missed `žščřďťňů`, and that's just the ones I see on my keyboard. Whitelisting known characters is not the best solution. – Piskvor left the building Sep 29 '14 at 10:22
  • @this.lau_ As mentioned in the question: I'm looking for the closest "one character ASCII", so no - two letter decomposition would not be correct for my use case. One letter is correct for what I'm looking to do. – BlaM Oct 28 '15 at 12:51
1

If you are using WordPress, you can use the built-in function remove_accents( $string )

https://codex.wordpress.org/Function_Reference/remove_accents

However I noticed a bug : it doesn’t work on a string with a single character.

youtag
  • 377
  • 3
  • 4
  • 1
    Despite not actually being an exact answer, I appreciate this answer as I'm using WordPress. So thanks! ;) – Vladan Feb 17 '22 at 09:19
0

For Arabic and Persian users i recommend this way to remove diacritics:

    $diacritics = array('َ','ِ','ً','ٌ','ٍ','ّ','ْ','ـ');
    $search_txt = str_replace($diacritics, '', $diacritics);

For typing diacritics in Arabic keyboards u can use this Asci(those codes are Asci not Unicode) codes in windows editors typing diacritics directly or holding Alt + (type the code of diacritic character) This is the codes

ـَ(0243) ـِ(0246) ـُ(0245) ـً(0240) ـٍ(0242) ـٌ(0241) ـْ(0250) ـّ(0248) ـ ـ(0220)

ganji
  • 752
  • 7
  • 17
0

I found that this one gives the most consistent results in French and German. with the meta tag set to utf-8, I have place it in a function to return a line from a array of words and it works perfect.

htmlentities (  $line, ENT_SUBSTITUTE   , 'utf-8' ) 
kRiZ
  • 2,320
  • 4
  • 28
  • 39
jay
  • 9
  • 1
  • This will return HTML entities. eg München will become München. But the requested result should be Muenchen. – kirschkern Mar 10 '23 at 22:06
0

The canonical way to do this:

  1. Obtain the Normalization Form Canonical Decomposition of the text. See https://unicode.org/reports/tr15/ for Unicode Normalization Forms.
  2. Remove nonspacing marks.
  3. Obtain the Normalization Form Canonical Composition of the remaining text.

https://unicode-org.github.io/icu/userguide/transforms/general/

For example, to remove accents from characters, use the following transform:

NFD; [:Nonspacing Mark:] Remove; NFC.

I am a bit unsure why they have given this example as such when the page also notes

each transform rule consists of two colons followed by a transform name.

So we will add those. You need the intl extension which wraps the ICU library.

$t = \Transliterator::createFromRules(':: NFD; ::[:Nonspacing Mark:] Remove; :: NFC;');

Example

print $t->transliterate('أ');

This transforms U+0623 (Arabic Letter Alef with Hamza Above) to U+0627 (Arabic Letter Alef) ie it works with non-latin letters and their accents as well.

You can replace [:Nonspacing Mark:] with [:Mn:].

chx
  • 11,270
  • 7
  • 55
  • 129