PHP: Replace umlauts with closest 7-bit ASCII equivalent in an UTF-8 string

Question

What I want to do is to remove all accents and umlauts from a string, turning "lärm" into "larm" or "andré" into "andre". What I tried to do was to utf8_decode the string and then use strtr on it, but since my source file is saved as UTF-8 file, I can't enter the ISO-8859-15 characters for all umlauts - the editor inserts the UTF-8 characters.

Obviously a solution for this would be to have an include that's an ISO-8859-15 file, but there must be a better way than to have another required include?

echo strtr(utf8_decode($input), 
           'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ',
           'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');

UPDATE: Maybe I was a bit inaccurate with what I try to do: I do not actually want to remove the umlauts, but to replace them with their closest "one character ASCII" equivalent.

Keep in mind that the string you produce will not necessarily have the same meaning as the original string, as discussed in this [similar question](http://stackoverflow.com/questions/140422/how-do-i-translate-8bit-characters-into-7bit-characters-ie-220-to-u). It's a serviceable approach for cleaning file names, but probably not something you'd want to do if you are planning to display your new string as text. — Dave DuPlantis, Oct 01 '08 at 18:57
Thanks for the hint. However the resulting string will be used as a simplified version fallback for search if "binary search" fails. Even more simplifications will be applied after this one - to allow illiterates to still find what they are looking for :) — BlaM, Oct 05 '08 at 00:40
There actually is a valid reason to do it for displayed characters. Generation of HTML 4.1 compliant id attributes for navigation menus. For example, if I have
Für Elise
and I want to generate an id anchor above it, is the best I can do and still be compliant with html 4.1 which may be necessary for some older browsers. — Alice Wonder, Nov 14 '11 at 22:58

score 59 · Accepted Answer · answered Oct 01 '08 at 15:38

59

iconv("utf-8","ascii//TRANSLIT",$input);

Extended example

answered Oct 01 '08 at 15:38

Vinko Vrsalovic

330,807
53
334
373

4

I had to add "setlocale(LC_ALL, 'en_US');" (sadly no locals for Germany seem to be available on my machine :( ), but then it works. Great! :) – BlaM Oct 01 '08 at 15:52
16

Why does this solution return `"o` for `ö` on my machine and on the examples in the [php reference](http://www.php.net/manual/en/function.iconv.php#105507) it returns `oe`? – spikey May 14 '12 at 12:11
4

This does not work for Cyrillic characters. They are converted to ? question marks instead. – Zebooka Jul 12 '12 at 17:51
2

This bombs with a value of false and gives me a notice that illegal characters were encountered... – Matt Apr 25 '13 at 19:32
2

To spikey's comment: if you set your locale to de_*.UTF8 (de_DE.UTF8, de_CH.UTF8, etc.), then umlauts will be converted to *e (ü->ue). Set it to en_US..UTF8 to get the desired effect. – Michał Leon Dec 19 '13 at 15:46
I have the same problem as spikey, setlocale stuff doesn't help also. – edditor Sep 26 '14 at 08:46
setlocale() depends on your operating system, is not thread-safe and wreaks havoc if you do it wrong (such as treating commas as periods in conversions). Either be careful (using LC_CTYPE instead of LC_ALL in this case) or stay away from it unless you know exactly what you're doing. – PeerBr Nov 13 '14 at 12:30
Use `"ascii//translit//ignore"` to prevent "illegal characters encountered" error. – Jose Manuel Abarca Rodríguez Mar 21 '19 at 16:41
1

If `iconv()` with `ASCII//TRANSLIT` doesn't work for you with German umlauts (ä/ö/ü => ae/oe/ue, despite setting `setlocale()` to a German utf8 locale, [this answer to another question](https://stackoverflow.com/questions/50412085/how-to-make-textslug-convert-german-umlauts-properly#answer-50415750) was the solution for me, using [`transliterator_transliterate()`](https://www.php.net/manual/en/transliterator.transliterate.php) with `de-ASCII` supplied via the transliterator build string. – Constantin Groß Aug 02 '21 at 21:06

score 33 · Answer 2 · answered May 10 '11 at 13:14

A little trick that doesn't require setting locales or having huge translation tables:

function Unaccent($string)
{
    if (strpos($string = htmlentities($string, ENT_QUOTES, 'UTF-8'), '&') !== false)
    {
        $string = html_entity_decode(preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|tilde|uml);~i', '$1', $string), ENT_QUOTES, 'UTF-8');
    }

    return $string;
}

The only requirement for it to work properly is to save your files in UTF-8 (as you should already).

Works great for hungarian – vinczemarton Nov 04 '16 at 12:27 — vinczemarton, Nov 04 '16 at 12:27

score 9 · Answer 3 · answered Feb 03 '16 at 13:12

you can also try this

$string = "Fóø Bår";
$transliterator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);
echo $normalized = $transliterator->transliterate($string);

but you need to have http://php.net/manual/en/book.intl.php available

BlaM · Answer 4 · 2008-10-01T15:55:51.373

1

Okay, found an obvious solution myself, but it's not the best concerning performance...

echo strtr(utf8_decode($input), 
           utf8_decode('ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
           'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');

edited Oct 01 '08 at 15:55

answered Oct 01 '08 at 15:33

BlaM

28,465
32
91
105

2

It's not the best in terms of performance and it also produces incorrect result. Letters like Œ, Æ, etc. should decompose to two letters, not to one. – laurent Dec 23 '12 at 07:41
2

You have missed `žščřďťňů`, and that's just the ones I see on my keyboard. Whitelisting known characters is not the best solution. – Piskvor left the building Sep 29 '14 at 10:22
@this.lau_ As mentioned in the question: I'm looking for the closest "one character ASCII", so no - two letter decomposition would not be correct for my use case. One letter is correct for what I'm looking to do. – BlaM Oct 28 '15 at 12:51

score 1 · Answer 5 · answered Jun 01 '18 at 14:15

1

If you are using WordPress, you can use the built-in function remove_accents( $string )

https://codex.wordpress.org/Function_Reference/remove_accents

However I noticed a bug : it doesn’t work on a string with a single character.

answered Jun 01 '18 at 14:15

youtag

377
3
4

1

Despite not actually being an exact answer, I appreciate this answer as I'm using WordPress. So thanks! ;) – Vladan Feb 17 '22 at 09:19

score 0 · Answer 6 · answered Nov 08 '14 at 11:55

For Arabic and Persian users i recommend this way to remove diacritics:

    $diacritics = array('َ','ِ','ً','ٌ','ٍ','ّ','ْ','ـ');
    $search_txt = str_replace($diacritics, '', $diacritics);

For typing diacritics in Arabic keyboards u can use this Asci(those codes are Asci not Unicode) codes in windows editors typing diacritics directly or holding Alt + (type the code of diacritic character) This is the codes

ـَ(0243) ـِ(0246) ـُ(0245) ـً(0240) ـٍ(0242) ـٌ(0241) ـْ(0250) ـّ(0248) ـ ـ(0220)

score 0 · Answer 7 · edited Aug 24 '16 at 02:23

0

I found that this one gives the most consistent results in French and German. with the meta tag set to utf-8, I have place it in a function to return a line from a array of words and it works perfect.

htmlentities (  $line, ENT_SUBSTITUTE   , 'utf-8' )

edited Aug 24 '16 at 02:23

kRiZ

2,320
4
28
39

answered Aug 24 '16 at 00:18

jay

9
1

This will return HTML entities. eg München will become München. But the requested result should be Muenchen. – kirschkern Mar 10 '23 at 22:06

score 0 · Answer 8 · answered Jun 20 '23 at 16:24

The canonical way to do this:

Obtain the Normalization Form Canonical Decomposition of the text. See https://unicode.org/reports/tr15/ for Unicode Normalization Forms.
Remove nonspacing marks.
Obtain the Normalization Form Canonical Composition of the remaining text.

https://unicode-org.github.io/icu/userguide/transforms/general/

For example, to remove accents from characters, use the following transform:

NFD; [:Nonspacing Mark:] Remove; NFC.

I am a bit unsure why they have given this example as such when the page also notes

each transform rule consists of two colons followed by a transform name.

So we will add those. You need the intl extension which wraps the ICU library.

$t = \Transliterator::createFromRules(':: NFD; ::[:Nonspacing Mark:] Remove; :: NFC;');

Example

print $t->transliterate('أ');

This transforms U+0623 (Arabic Letter Alef with Hamza Above) to U+0627 (Arabic Letter Alef) ie it works with non-latin letters and their accents as well.

You can replace [:Nonspacing Mark:] with [:Mn:].

PHP: Replace umlauts with closest 7-bit ASCII equivalent in an UTF-8 string

Für Elise

8 Answers8

Linked

Related