128

I am trying to replace accented characters with the normal replacements. Below is what I am currently doing.

    $string = "Éric Cantona";
    $strict = strtolower($string);

    echo "After Lower: ".$strict;

    $patterns[0] = '/[á|â|à|å|ä]/';
    $patterns[1] = '/[ð|é|ê|è|ë]/';
    $patterns[2] = '/[í|î|ì|ï]/';
    $patterns[3] = '/[ó|ô|ò|ø|õ|ö]/';
    $patterns[4] = '/[ú|û|ù|ü]/';
    $patterns[5] = '/æ/';
    $patterns[6] = '/ç/';
    $patterns[7] = '/ß/';
    $replacements[0] = 'a';
    $replacements[1] = 'e';
    $replacements[2] = 'i';
    $replacements[3] = 'o';
    $replacements[4] = 'u';
    $replacements[5] = 'ae';
    $replacements[6] = 'c';
    $replacements[7] = 'ss';

    $strict = preg_replace($patterns, $replacements, $strict);
    echo "Final: ".$strict;

This gives me:

    After Lower: éric cantona
    Final: ric cantona

The above gives me ric cantona I want the output to be eric cantona.

can anyone help me with where I am going wrong?

Lizard
  • 43,732
  • 39
  • 106
  • 167
  • 1
    For what it's worth, I copied and pasted, and ran this verbatim and got "eric cantona" (using php 5.2.9-4) – Brandon Horsley Jul 30 '10 at 13:10
  • 1
    @brandon it will depend on the encoding that you save the file in. I assume that lizard saved it as utf-8, and you saved it as iso-8859-1. – troelskn Jul 30 '10 at 13:12
  • What version of php are you using? – Brandon Horsley Jul 30 '10 at 13:14
  • possible duplicate of [Problem with function removing accents and other characters in PHP](http://stackoverflow.com/questions/606631/problem-with-function-removing-accents-and-other-characters-in-php) – outis Jul 22 '12 at 08:29
  • 1
    You could try this package: https://github.com/rap2hpoutre/convert-accent-characters – rap-2-h Dec 27 '17 at 15:55

20 Answers20

211

I have tried all sorts based on the variations listed in the answers, but the following worked:

$unwanted_array = array(    'Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
                            'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U',
                            'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c',
                            'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o',
                            'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y' );
$str = strtr( $str, $unwanted_array );
Getz
  • 3,983
  • 6
  • 35
  • 52
Lizard
  • 43,732
  • 39
  • 106
  • 167
  • 15
    Add these for Turkish support: `'Ğ'=>'G', 'İ'=>'I', 'Ş'=>'S', 'ğ'=>'g', 'ı'=>'i', 'ş'=>'s', 'ü'=>'u',` – Halil Özgür Mar 20 '12 at 10:18
  • 12
    Add these for Romanian support: 'ă'=>'a', 'Ă'=>'A', 'ș'=>'s', 'Ș'=>'S', 'ț'=>'t', 'Ț'=>'T' – Vlad Mar 24 '13 at 21:52
  • 5
    There is a minor Error: 'ß' can not be translated to 'Ss' but must be replaced with 'ss'. This german exclusive character is never used in an uppercase scope. – KTB Jan 22 '14 at 10:18
  • 4
    I think Germans prefer to translate 'Ä'=>'AE', instead of 'Ä'=>'A'. I read somewhere that if they cannot type the two dots (like on credit cards) they put "E" after the letter, instead of just simply removing the dots. So Jäger would actually become Jaeger, instead of Jager. – Pringles Jun 23 '14 at 10:16
  • 4
    Since a lot of people have upvoted this answer, it needs to be said that the safer way is to use chr() instead of hard-coded accented characters, due to different editors the file may be opened with. – Mladen B. Sep 17 '14 at 04:53
  • is this possible to do using regex ? – Hitesh Jul 18 '16 at 08:17
  • def best answer, the rest may or may not require encoding/decoding and some depend on your version of PHP – Robert Sinclair Nov 10 '16 at 19:40
  • Not the most elegant solution, but a simple solution that works. Thanks ! – CTala May 01 '17 at 19:13
  • 2
    Isn't it better to use `iconv`? – Rodrigo Aug 08 '19 at 00:35
  • and what is $str? – f7n Dec 11 '19 at 11:01
  • 2
    Add these for Hungarian support: `'ű'=>'u', 'Ű'=>'U', 'ő'=>'o', 'Ő'=>'O', 'ü'=>'u'` – Sirsemy Dec 16 '21 at 13:42
129

To remove the diacritics, use iconv:

$val = iconv('ISO-8859-1','ASCII//TRANSLIT',$val);

or

$val = iconv('UTF-8','ASCII//TRANSLIT',$val);

note that php has some weird bug in that it (sometimes?) needs to have a locale set to make these conversions work, using setlocale().

edit tested, it gets all of your diacritics out of the box:

$val = "á|â|à|å|ä ð|é|ê|è|ë í|î|ì|ï ó|ô|ò|ø|õ|ö ú|û|ù|ü æ ç ß abc ABC 123";
echo iconv('UTF-8','ASCII//TRANSLIT',$val); 

output (updated 2019-12-30)

a|a|a|a|a d|e|e|e|e i|i|i|i o|o|o|o|o|o u|u|u|u ae c ss abc ABC 123

Note that ð is correctly transliterated to d instead of o, as in the accepted answer.

mvds
  • 45,755
  • 8
  • 102
  • 111
  • 34
    Worth noting that `iconv` will error and cut the string off at 'illegal characters'. To solve this, you can use `iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $val)` – Rowan Mar 10 '11 at 15:17
  • 7
    Didn't worked here. With `iconv('ISO-8859-1', 'ASCII//TRANSLIT', $val)`, `áêìõç` became `'a^e\`i~oc`. – Rafael Barros Sep 02 '14 at 19:44
  • I don't think these things are entirely related to PHP alone. Could they also depend on the locales and/or particular version of the iconv library installed? – mvds Sep 11 '14 at 10:20
  • His answer seems to me the best, maybe "merge" your suggestion to `$c = mb_detect_encoding($text, mb_detect_order(), true); $val = iconv($c, 'ASCII//TRANSLIT',$val);` is a good way? :) Thanks +1 – Protomen Apr 28 '15 at 05:25
  • 7
    This fixed the question marks and quotes for me `setlocale(LC_ALL, "en_US.utf8"); $string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);` – Josh Bernfeld Jun 25 '16 at 23:30
  • 1
    above PHP example gives me `?|?|?|?|? ?|?|?|?|? ?|?|?|? ?|?|?|?|?|? ?|?|?|? ae ? ss abc ABC 123 ` – Umair Ayub Jan 24 '20 at 10:34
  • This doesn't work for me and just removes accented letters – Tom Mar 23 '20 at 15:42
  • @mvds Setting locale to "en_US.utf8" helps but it's not in this answer. This is my answer: https://stackoverflow.com/a/60816979/1404447 – Tom Mar 23 '20 at 15:53
  • for some reason it is adding a double quote in the replaced accent char `ë` => `"e` – albanx Apr 13 '22 at 11:27
  • tried this and almost none of the replaced characters worked, my output was: ```'a|^a|`a|a|"a d|'e|^e|`e|"e 'i|^i|`i|"i 'o|^o|`o|o|~o|"o 'u|^u|`u|"u ae c ss abc ABC 123``` – Nathan G-T Jul 20 '22 at 20:51
51

I just came accross the answer from Lizard which is extremely helpful - especially when you do some sorting. Isn't is beautiful how many chars we need to say mostly the same ;)

If anyone else if looking for a all-in solution (as far as the comments above tell), here's the copy&paste:

/**
 * Replace language-specific characters by ASCII-equivalents.
 * @param string $s
 * @return string
 */
public static function normalizeChars($s) {
    $replace = array(
        'ъ'=>'-', 'Ь'=>'-', 'Ъ'=>'-', 'ь'=>'-',
        'Ă'=>'A', 'Ą'=>'A', 'À'=>'A', 'Ã'=>'A', 'Á'=>'A', 'Æ'=>'A', 'Â'=>'A', 'Å'=>'A', 'Ä'=>'Ae',
        'Þ'=>'B',
        'Ć'=>'C', 'ץ'=>'C', 'Ç'=>'C',
        'È'=>'E', 'Ę'=>'E', 'É'=>'E', 'Ë'=>'E', 'Ê'=>'E',
        'Ğ'=>'G',
        'İ'=>'I', 'Ï'=>'I', 'Î'=>'I', 'Í'=>'I', 'Ì'=>'I',
        'Ł'=>'L',
        'Ñ'=>'N', 'Ń'=>'N',
        'Ø'=>'O', 'Ó'=>'O', 'Ò'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'Oe',
        'Ş'=>'S', 'Ś'=>'S', 'Ș'=>'S', 'Š'=>'S',
        'Ț'=>'T',
        'Ù'=>'U', 'Û'=>'U', 'Ú'=>'U', 'Ü'=>'Ue',
        'Ý'=>'Y',
        'Ź'=>'Z', 'Ž'=>'Z', 'Ż'=>'Z',
        'â'=>'a', 'ǎ'=>'a', 'ą'=>'a', 'á'=>'a', 'ă'=>'a', 'ã'=>'a', 'Ǎ'=>'a', 'а'=>'a', 'А'=>'a', 'å'=>'a', 'à'=>'a', 'א'=>'a', 'Ǻ'=>'a', 'Ā'=>'a', 'ǻ'=>'a', 'ā'=>'a', 'ä'=>'ae', 'æ'=>'ae', 'Ǽ'=>'ae', 'ǽ'=>'ae',
        'б'=>'b', 'ב'=>'b', 'Б'=>'b', 'þ'=>'b',
        'ĉ'=>'c', 'Ĉ'=>'c', 'Ċ'=>'c', 'ć'=>'c', 'ç'=>'c', 'ц'=>'c', 'צ'=>'c', 'ċ'=>'c', 'Ц'=>'c', 'Č'=>'c', 'č'=>'c', 'Ч'=>'ch', 'ч'=>'ch',
        'ד'=>'d', 'ď'=>'d', 'Đ'=>'d', 'Ď'=>'d', 'đ'=>'d', 'д'=>'d', 'Д'=>'D', 'ð'=>'d',
        'є'=>'e', 'ע'=>'e', 'е'=>'e', 'Е'=>'e', 'Ə'=>'e', 'ę'=>'e', 'ĕ'=>'e', 'ē'=>'e', 'Ē'=>'e', 'Ė'=>'e', 'ė'=>'e', 'ě'=>'e', 'Ě'=>'e', 'Є'=>'e', 'Ĕ'=>'e', 'ê'=>'e', 'ə'=>'e', 'è'=>'e', 'ë'=>'e', 'é'=>'e',
        'ф'=>'f', 'ƒ'=>'f', 'Ф'=>'f',
        'ġ'=>'g', 'Ģ'=>'g', 'Ġ'=>'g', 'Ĝ'=>'g', 'Г'=>'g', 'г'=>'g', 'ĝ'=>'g', 'ğ'=>'g', 'ג'=>'g', 'Ґ'=>'g', 'ґ'=>'g', 'ģ'=>'g',
        'ח'=>'h', 'ħ'=>'h', 'Х'=>'h', 'Ħ'=>'h', 'Ĥ'=>'h', 'ĥ'=>'h', 'х'=>'h', 'ה'=>'h',
        'î'=>'i', 'ï'=>'i', 'í'=>'i', 'ì'=>'i', 'į'=>'i', 'ĭ'=>'i', 'ı'=>'i', 'Ĭ'=>'i', 'И'=>'i', 'ĩ'=>'i', 'ǐ'=>'i', 'Ĩ'=>'i', 'Ǐ'=>'i', 'и'=>'i', 'Į'=>'i', 'י'=>'i', 'Ї'=>'i', 'Ī'=>'i', 'І'=>'i', 'ї'=>'i', 'і'=>'i', 'ī'=>'i', 'ij'=>'ij', 'IJ'=>'ij',
        'й'=>'j', 'Й'=>'j', 'Ĵ'=>'j', 'ĵ'=>'j', 'я'=>'ja', 'Я'=>'ja', 'Э'=>'je', 'э'=>'je', 'ё'=>'jo', 'Ё'=>'jo', 'ю'=>'ju', 'Ю'=>'ju',
        'ĸ'=>'k', 'כ'=>'k', 'Ķ'=>'k', 'К'=>'k', 'к'=>'k', 'ķ'=>'k', 'ך'=>'k',
        'Ŀ'=>'l', 'ŀ'=>'l', 'Л'=>'l', 'ł'=>'l', 'ļ'=>'l', 'ĺ'=>'l', 'Ĺ'=>'l', 'Ļ'=>'l', 'л'=>'l', 'Ľ'=>'l', 'ľ'=>'l', 'ל'=>'l',
        'מ'=>'m', 'М'=>'m', 'ם'=>'m', 'м'=>'m',
        'ñ'=>'n', 'н'=>'n', 'Ņ'=>'n', 'ן'=>'n', 'ŋ'=>'n', 'נ'=>'n', 'Н'=>'n', 'ń'=>'n', 'Ŋ'=>'n', 'ņ'=>'n', 'ʼn'=>'n', 'Ň'=>'n', 'ň'=>'n',
        'о'=>'o', 'О'=>'o', 'ő'=>'o', 'õ'=>'o', 'ô'=>'o', 'Ő'=>'o', 'ŏ'=>'o', 'Ŏ'=>'o', 'Ō'=>'o', 'ō'=>'o', 'ø'=>'o', 'ǿ'=>'o', 'ǒ'=>'o', 'ò'=>'o', 'Ǿ'=>'o', 'Ǒ'=>'o', 'ơ'=>'o', 'ó'=>'o', 'Ơ'=>'o', 'œ'=>'oe', 'Œ'=>'oe', 'ö'=>'oe',
        'פ'=>'p', 'ף'=>'p', 'п'=>'p', 'П'=>'p',
        'ק'=>'q',
        'ŕ'=>'r', 'ř'=>'r', 'Ř'=>'r', 'ŗ'=>'r', 'Ŗ'=>'r', 'ר'=>'r', 'Ŕ'=>'r', 'Р'=>'r', 'р'=>'r',
        'ș'=>'s', 'с'=>'s', 'Ŝ'=>'s', 'š'=>'s', 'ś'=>'s', 'ס'=>'s', 'ş'=>'s', 'С'=>'s', 'ŝ'=>'s', 'Щ'=>'sch', 'щ'=>'sch', 'ш'=>'sh', 'Ш'=>'sh', 'ß'=>'ss',
        'т'=>'t', 'ט'=>'t', 'ŧ'=>'t', 'ת'=>'t', 'ť'=>'t', 'ţ'=>'t', 'Ţ'=>'t', 'Т'=>'t', 'ț'=>'t', 'Ŧ'=>'t', 'Ť'=>'t', '™'=>'tm',
        'ū'=>'u', 'у'=>'u', 'Ũ'=>'u', 'ũ'=>'u', 'Ư'=>'u', 'ư'=>'u', 'Ū'=>'u', 'Ǔ'=>'u', 'ų'=>'u', 'Ų'=>'u', 'ŭ'=>'u', 'Ŭ'=>'u', 'Ů'=>'u', 'ů'=>'u', 'ű'=>'u', 'Ű'=>'u', 'Ǖ'=>'u', 'ǔ'=>'u', 'Ǜ'=>'u', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'У'=>'u', 'ǚ'=>'u', 'ǜ'=>'u', 'Ǚ'=>'u', 'Ǘ'=>'u', 'ǖ'=>'u', 'ǘ'=>'u', 'ü'=>'ue',
        'в'=>'v', 'ו'=>'v', 'В'=>'v',
        'ש'=>'w', 'ŵ'=>'w', 'Ŵ'=>'w',
        'ы'=>'y', 'ŷ'=>'y', 'ý'=>'y', 'ÿ'=>'y', 'Ÿ'=>'y', 'Ŷ'=>'y',
        'Ы'=>'y', 'ž'=>'z', 'З'=>'z', 'з'=>'z', 'ź'=>'z', 'ז'=>'z', 'ż'=>'z', 'ſ'=>'z', 'Ж'=>'zh', 'ж'=>'zh'
    );
    return strtr($s, $replace);
}

Note some slight changes regarding the German umlauts (ä => ae)

Edit: Included more characters based on the posting from user3682119 (except for the copyright symbol) and the comment from daker.

BurninLeo
  • 4,240
  • 4
  • 39
  • 56
  • 2
    Thanks for updating the list from @Lizard. Still missing some chars though, at least the Polish ones: `'Ą' => 'A', 'ą' => 'a', 'Ć' => 'C', 'ć' => 'c', 'Ę' => 'E', 'ę' => 'e', 'Ł' => 'L', 'ł' => 'l', 'Ń' => 'N', 'ń' => 'n', 'Ś' => 'S', 'ś' => 's', 'Ż' => 'Z', 'ż' => 'z', 'Ź' => 'Z', 'ź' => 'z'` – kasimir Jun 25 '14 at 08:22
  • 3
    This is awesome, however, the lower case char are mixed with upper ones unlike uppers. eg : d => д d => Д. This is wrong, only D => Д should be in this table i think, right ? – Kwaadpepper Nov 22 '15 at 12:12
  • 1
    Just to mention an idea: this also allowed me to build regex matching regardless of special chars :p ```rss => '[rŕřŘŗŖרŔРр](?:[sșсŜšśסşСŝ][sșсŜšśסşСŝ]|[ß])'``` – Kwaadpepper Nov 22 '15 at 12:15
  • Here is a script cleaning up this answer. http://paste.debian.net/334940/ And the full cleaned result ready to work with : http://paste.debian.net/334948/ Note that double and triple letter index are only present on lower case to avoid multiple combination so they include lower and upper case chars – Kwaadpepper Nov 22 '15 at 14:11
  • Today I got this issue but this answer wasn't enough because my string had an accent in another character. So I had for example a simple 'o' and then 2 strange characters. I url encoded them and those are : "%CC%81". So I added `urldecode('%CC%81') => '',` to the `$replace` array and fixed my problem. – Javier Enríquez Mar 25 '16 at 03:12
  • I assume, that is the UTF-8 character ́ (COMBINING ACUTE ACCENT, see http://www.utf8-chartable.de/unicode-utf8-table.pl?start=768&number=128) that has something like a negative margin on the left to be placed above the previous character - like this: x́ (this is an X with the character behind!) Interesting stuff :) UTF-8 knows a lot such characters - therefore, it may be sensible to `preg_replace('/[^a-z0-9 ]/i', '', $s)` after doing the above replacements. – BurninLeo Mar 30 '16 at 12:40
46

In PHP 5.4 the intl extension provides a new class named Transliterator.

I believe that's the best way to remove diacritics for two reasons:

  1. Transliterator is based on ICU, so you're using the tables of the ICU library. ICU is a great project, developed over the year to provide comprehensive tables and functionalities. Whatever table you want to write yourself, it will never be as complete as the one from ICU.

  2. In UTF-8, characters could be represented differently. For example, the character ñ could be saved as a single (multi-byte) character, or as the combination of characters ˜ (multibyte) and n. In addition to this, some characters in Unicode are homograph: they look the same while having different codepoints. For this reason it's also important to normalize the string.

Here's a sample code, taken from an old answer of mine:

<?php
$transliterator = Transliterator::createFromRules(':: NFD; :: [:Nonspacing Mark:] Remove; :: NFC;', Transliterator::FORWARD);
$test = ['abcd', 'èe', '€', 'àòùìéëü', 'àòùìéëü', 'tiësto'];
foreach($test as $e) {
    $normalized = $transliterator->transliterate($e);
    echo $e. ' --> '.$normalized."\n";
}
?>

Result:

abcd --> abcd
èe --> ee
€ --> €
àòùìéëü --> aouieeu
àòùìéëü --> aouieeu
tiësto --> tiesto

The first argument for the Transliterator class performs the removal of diacritics as well as the normalization of the string.

Community
  • 1
  • 1
ItalyPaleAle
  • 7,185
  • 6
  • 42
  • 69
  • Thanks. but I try your code, "olivæ" is still "olivæ" not "olivae" – Terry Lin Jan 11 '17 at 15:05
  • 5
    I use transliterator_transliterate('Any-Latin; Latin-ASCII', "A æ Übérmensch på høyeste nivå! И я люблю PHP! fi") to solve my problem – Terry Lin Jan 11 '17 at 15:18
  • 1
    Yes `\Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: NFC;', \Transliterator::FORWARD)` will do the job – Rey0bs Nov 13 '17 at 16:48
  • Definitively agree with going to standards instead of reinventing the wheel. ICU seems the best reference. Instead, the documentation at `https://www.php.net/manual/en/transliterator.createfromrules.php` does not talk about the "rules". Where can we find a full description of what's accepted by `createFromRules()`? – Xavi Montero Apr 24 '19 at 12:01
  • 1
    @XaviMontero check out the documentation for ICU: http://userguide.icu-project.org/transforms/general/rules – ItalyPaleAle Nov 07 '19 at 14:55
  • The solution of Terry Lin seems to work well, many thanks! `transliterator_transliterate('Any-Latin; Latin-ASCII', $string)` – CheddarLizzard Jun 05 '20 at 15:38
20

An updated answer based on @BurninLeo's answer

function replace_spec_char($subject) {
    $char_map = array(
        "ъ" => "-", "ь" => "-", "Ъ" => "-", "Ь" => "-",
        "А" => "A", "Ă" => "A", "Ǎ" => "A", "Ą" => "A", "À" => "A", "Ã" => "A", "Á" => "A", "Æ" => "A", "Â" => "A", "Å" => "A", "Ǻ" => "A", "Ā" => "A", "א" => "A",
        "Б" => "B", "ב" => "B", "Þ" => "B",
        "Ĉ" => "C", "Ć" => "C", "Ç" => "C", "Ц" => "C", "צ" => "C", "Ċ" => "C", "Č" => "C", "©" => "C", "ץ" => "C",
        "Д" => "D", "Ď" => "D", "Đ" => "D", "ד" => "D", "Ð" => "D",
        "È" => "E", "Ę" => "E", "É" => "E", "Ë" => "E", "Ê" => "E", "Е" => "E", "Ē" => "E", "Ė" => "E", "Ě" => "E", "Ĕ" => "E", "Є" => "E", "Ə" => "E", "ע" => "E",
        "Ф" => "F", "Ƒ" => "F",
        "Ğ" => "G", "Ġ" => "G", "Ģ" => "G", "Ĝ" => "G", "Г" => "G", "ג" => "G", "Ґ" => "G",
        "ח" => "H", "Ħ" => "H", "Х" => "H", "Ĥ" => "H", "ה" => "H",
        "I" => "I", "Ï" => "I", "Î" => "I", "Í" => "I", "Ì" => "I", "Į" => "I", "Ĭ" => "I", "I" => "I", "И" => "I", "Ĩ" => "I", "Ǐ" => "I", "י" => "I", "Ї" => "I", "Ī" => "I", "І" => "I",
        "Й" => "J", "Ĵ" => "J",
        "ĸ" => "K", "כ" => "K", "Ķ" => "K", "К" => "K", "ך" => "K",
        "Ł" => "L", "Ŀ" => "L", "Л" => "L", "Ļ" => "L", "Ĺ" => "L", "Ľ" => "L", "ל" => "L",
        "מ" => "M", "М" => "M", "ם" => "M",
        "Ñ" => "N", "Ń" => "N", "Н" => "N", "Ņ" => "N", "ן" => "N", "Ŋ" => "N", "נ" => "N", "ʼn" => "N", "Ň" => "N",
        "Ø" => "O", "Ó" => "O", "Ò" => "O", "Ô" => "O", "Õ" => "O", "О" => "O", "Ő" => "O", "Ŏ" => "O", "Ō" => "O", "Ǿ" => "O", "Ǒ" => "O", "Ơ" => "O",
        "פ" => "P", "ף" => "P", "П" => "P",
        "ק" => "Q",
        "Ŕ" => "R", "Ř" => "R", "Ŗ" => "R", "ר" => "R", "Р" => "R", "®" => "R",
        "Ş" => "S", "Ś" => "S", "Ș" => "S", "Š" => "S", "С" => "S", "Ŝ" => "S", "ס" => "S",
        "Т" => "T", "Ț" => "T", "ט" => "T", "Ŧ" => "T", "ת" => "T", "Ť" => "T", "Ţ" => "T",
        "Ù" => "U", "Û" => "U", "Ú" => "U", "Ū" => "U", "У" => "U", "Ũ" => "U", "Ư" => "U", "Ǔ" => "U", "Ų" => "U", "Ŭ" => "U", "Ů" => "U", "Ű" => "U", "Ǖ" => "U", "Ǜ" => "U", "Ǚ" => "U", "Ǘ" => "U",
        "В" => "V", "ו" => "V",
        "Ý" => "Y", "Ы" => "Y", "Ŷ" => "Y", "Ÿ" => "Y",
        "Ź" => "Z", "Ž" => "Z", "Ż" => "Z", "З" => "Z", "ז" => "Z",
        "а" => "a", "ă" => "a", "ǎ" => "a", "ą" => "a", "à" => "a", "ã" => "a", "á" => "a", "æ" => "a", "â" => "a", "å" => "a", "ǻ" => "a", "ā" => "a", "א" => "a",
        "б" => "b", "ב" => "b", "þ" => "b",
        "ĉ" => "c", "ć" => "c", "ç" => "c", "ц" => "c", "צ" => "c", "ċ" => "c", "č" => "c", "©" => "c", "ץ" => "c",
        "Ч" => "ch", "ч" => "ch",
        "д" => "d", "ď" => "d", "đ" => "d", "ד" => "d", "ð" => "d",
        "è" => "e", "ę" => "e", "é" => "e", "ë" => "e", "ê" => "e", "е" => "e", "ē" => "e", "ė" => "e", "ě" => "e", "ĕ" => "e", "є" => "e", "ə" => "e", "ע" => "e",
        "ф" => "f", "ƒ" => "f",
        "ğ" => "g", "ġ" => "g", "ģ" => "g", "ĝ" => "g", "г" => "g", "ג" => "g", "ґ" => "g",
        "ח" => "h", "ħ" => "h", "х" => "h", "ĥ" => "h", "ה" => "h",
        "i" => "i", "ï" => "i", "î" => "i", "í" => "i", "ì" => "i", "į" => "i", "ĭ" => "i", "ı" => "i", "и" => "i", "ĩ" => "i", "ǐ" => "i", "י" => "i", "ї" => "i", "ī" => "i", "і" => "i",
        "й" => "j", "Й" => "j", "Ĵ" => "j", "ĵ" => "j",
        "ĸ" => "k", "כ" => "k", "ķ" => "k", "к" => "k", "ך" => "k",
        "ł" => "l", "ŀ" => "l", "л" => "l", "ļ" => "l", "ĺ" => "l", "ľ" => "l", "ל" => "l",
        "מ" => "m", "м" => "m", "ם" => "m",
        "ñ" => "n", "ń" => "n", "н" => "n", "ņ" => "n", "ן" => "n", "ŋ" => "n", "נ" => "n", "ʼn" => "n", "ň" => "n",
        "ø" => "o", "ó" => "o", "ò" => "o", "ô" => "o", "õ" => "o", "о" => "o", "ő" => "o", "ŏ" => "o", "ō" => "o", "ǿ" => "o", "ǒ" => "o", "ơ" => "o",
        "פ" => "p", "ף" => "p", "п" => "p",
        "ק" => "q",
        "ŕ" => "r", "ř" => "r", "ŗ" => "r", "ר" => "r", "р" => "r", "®" => "r",
        "ş" => "s", "ś" => "s", "ș" => "s", "š" => "s", "с" => "s", "ŝ" => "s", "ס" => "s",
        "т" => "t", "ț" => "t", "ט" => "t", "ŧ" => "t", "ת" => "t", "ť" => "t", "ţ" => "t",
        "ù" => "u", "û" => "u", "ú" => "u", "ū" => "u", "у" => "u", "ũ" => "u", "ư" => "u", "ǔ" => "u", "ų" => "u", "ŭ" => "u", "ů" => "u", "ű" => "u", "ǖ" => "u", "ǜ" => "u", "ǚ" => "u", "ǘ" => "u",
        "в" => "v", "ו" => "v",
        "ý" => "y", "ы" => "y", "ŷ" => "y", "ÿ" => "y",
        "ź" => "z", "ž" => "z", "ż" => "z", "з" => "z", "ז" => "z", "ſ" => "z",
        "™" => "tm",
        "@" => "at",
        "Ä" => "ae", "Ǽ" => "ae", "ä" => "ae", "æ" => "ae", "ǽ" => "ae",
        "ij" => "ij", "IJ" => "ij",
        "я" => "ja", "Я" => "ja",
        "Э" => "je", "э" => "je",
        "ё" => "jo", "Ё" => "jo",
        "ю" => "ju", "Ю" => "ju",
        "œ" => "oe", "Œ" => "oe", "ö" => "oe", "Ö" => "oe",
        "щ" => "sch", "Щ" => "sch",
        "ш" => "sh", "Ш" => "sh",
        "ß" => "ss",
        "Ü" => "ue",
        "Ж" => "zh", "ж" => "zh",
    );
    return strtr($subject, $char_map);
}

$string = "Ħí ŧħə®ë, юßť å test!";
echo replace_spec_char($string);

Ħí ŧħə®ë, юßť å test! => Hi there, jusst a test!

This does not mix up upper and lower case chars except for longer chars (eg: ss,ch, sch) , added @ ® ©

Also if you want to build regex matching regardless to special chars :

rss => '[rŕřŘŗŖרŔРр](?:[sșсŜšśסşСŝ][sșсŜšśסşСŝ]|[ß])'

A vala implementation of this : https://code.launchpad.net/~jeremy-munsch/synapse-project/ascii-smart/+merge/277477

Here is the base list you could work with, with regex replacing (in sublime text) or small script you can build anything from this array to fill your needs.

"-" => "ъьЪЬ",
"A" => "АĂǍĄÀÃÁÆÂÅǺĀא",
"B" => "БבÞ",
"C" => "ĈĆÇЦצĊČ©ץ",
"D" => "ДĎĐדÐ",
"E" => "ÈĘÉËÊЕĒĖĚĔЄƏע",
"F" => "ФƑ",
"G" => "ĞĠĢĜГגҐ",
"H" => "חĦХĤה",
"I" => "IÏÎÍÌĮĬIИĨǏיЇĪІ",
"J" => "ЙĴ",
"K" => "ĸכĶКך",
"L" => "ŁĿЛĻĹĽל",
"M" => "מМם",
"N" => "ÑŃНŅןŊנʼnŇ",
"O" => "ØÓÒÔÕОŐŎŌǾǑƠ",
"P" => "פףП",
"Q" => "ק",
"R" => "ŔŘŖרР®",
"S" => "ŞŚȘŠСŜס",
"T" => "ТȚטŦתŤŢ",
"U" => "ÙÛÚŪУŨƯǓŲŬŮŰǕǛǙǗ",
"V" => "Вו",
"Y" => "ÝЫŶŸ",
"Z" => "ŹŽŻЗז",
"a" => "аăǎąàãáæâåǻāא",
"b" => "бבþ",
"c" => "ĉćçцצċč©ץ",
"ch" => "ч",
"d" => "дďđדð",
"e" => "èęéëêеēėěĕєəע",
"f" => "фƒ",
"g" => "ğġģĝгגґ",
"h" => "חħхĥה",
"i" => "iïîíìįĭıиĩǐיїīі",
"j" => "йĵ",
"k" => "ĸכķкך",
"l" => "łŀлļĺľל",
"m" => "מмם",
"n" => "ñńнņןŋנʼnň",
"o" => "øóòôõоőŏōǿǒơ",
"p" => "פףп",
"q" => "ק",
"r" => "ŕřŗרр®",
"s" => "şśșšсŝס",
"t" => "тțטŧתťţ",
"u" => "ùûúūуũưǔųŭůűǖǜǚǘ",
"v" => "вו",
"y" => "ýыŷÿ",
"z" => "źžżзזſ",
"tm" => "™",
"at" => "@",
"ae" => "ÄǼäæǽ",
"ch" => "Чч",
"ij" => "ijIJ",
"j" => "йЙĴĵ",
"ja" => "яЯ",
"je" => "Ээ",
"jo" => "ёЁ",
"ju" => "юЮ",
"oe" => "œŒöÖ",
"sch" => "щЩ",
"sh" => "шШ",
"ss" => "ß",
"tm" => "™",
"ue" => "Ü",
"zh" => "Жж"
Kwaadpepper
  • 516
  • 4
  • 15
17

I found this way to be a good one, without having to worry too much about charsets and arrays, or iconv:

function replace_accents($str) {
   $str = htmlentities($str, ENT_COMPAT, "UTF-8");
   $str = preg_replace('/&([a-zA-Z])(uml|acute|grave|circ|tilde|ring);/','$1',$str);
   return html_entity_decode($str);
}
Lexxx
  • 33
  • 7
Jazzpaths
  • 645
  • 5
  • 9
  • 2
    Awesome solution. Works like a charm. However you should add the "slash" too for taking care of the norwegian oslash html entity as well: `$str = preg_replace('/&([a-zA-Z])(uml|acute|grave|circ|tilde|ring|slash);/','$1',$str);` – Ivan Dec 20 '20 at 14:14
14

So I found this on php.net page for preg_replace function

// replace accented chars

$string = "Zacarías Ferreíra"; // my definition for string variable
$accents = '/&([A-Za-z]{1,2})(grave|acute|circ|cedil|uml|lig);/';

$string_encoded = htmlentities($string,ENT_NOQUOTES,'UTF-8');

$string = preg_replace($accents,'$1',$string_encoded);

If you have encoding issues you may get someting like this "Zacarías Ferreíra", just decode the string and use said code above

$string = utf8_decode("Zacarías Ferreíra");
ItalyPaleAle
  • 7,185
  • 6
  • 42
  • 69
Kasey Thomas
  • 226
  • 3
  • 3
13

This worked for me:

<?php
setlocale(LC_ALL, "en_US.utf8"); 
$val = iconv('UTF-8','ASCII//TRANSLIT',$val);
?>
ItalyPaleAle
  • 7,185
  • 6
  • 42
  • 69
Stergios Zg.
  • 652
  • 6
  • 9
11

if you have http://php.net/manual/en/book.intl.php available, this will solve your problem:

$string = "Éric Cantona";
$transliterator = Transliterator::createFromRules(':: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);
echo $normalized = $transliterator->transliterate($string);

EDIT

To install the php extension in ubuntu:

apt-get install php-intl

Don't forget, in composer, to require the extension ext-intl to ensure it properly fits into deployed systems.

Xavi Montero
  • 9,239
  • 7
  • 57
  • 79
gabo
  • 1,538
  • 14
  • 15
  • 1
    If you want also to replace other caracters like 'æ', you can use `\Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: NFC;', \Transliterator::FORWARD)` instead – Rey0bs Nov 13 '17 at 16:51
9
protected $_convertTable = array(
    '&amp;' => 'and',   '@' => 'at',    '©' => 'c', '®' => 'r', 'À' => 'a',
    'Á' => 'a', 'Â' => 'a', 'Ä' => 'a', 'Å' => 'a', 'Æ' => 'ae','Ç' => 'c',
    'È' => 'e', 'É' => 'e', 'Ë' => 'e', 'Ì' => 'i', 'Í' => 'i', 'Î' => 'i',
    'Ï' => 'i', 'Ò' => 'o', 'Ó' => 'o', 'Ô' => 'o', 'Õ' => 'o', 'Ö' => 'o',
    'Ø' => 'o', 'Ù' => 'u', 'Ú' => 'u', 'Û' => 'u', 'Ü' => 'u', 'Ý' => 'y',
    'ß' => 'ss','à' => 'a', 'á' => 'a', 'â' => 'a', 'ä' => 'a', 'å' => 'a',
    'æ' => 'ae','ç' => 'c', 'è' => 'e', 'é' => 'e', 'ê' => 'e', 'ë' => 'e',
    'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i', 'ò' => 'o', 'ó' => 'o',
    'ô' => 'o', 'õ' => 'o', 'ö' => 'o', 'ø' => 'o', 'ù' => 'u', 'ú' => 'u',
    'û' => 'u', 'ü' => 'u', 'ý' => 'y', 'þ' => 'p', 'ÿ' => 'y', 'Ā' => 'a',
    'ā' => 'a', 'Ă' => 'a', 'ă' => 'a', 'Ą' => 'a', 'ą' => 'a', 'Ć' => 'c',
    'ć' => 'c', 'Ĉ' => 'c', 'ĉ' => 'c', 'Ċ' => 'c', 'ċ' => 'c', 'Č' => 'c',
    'č' => 'c', 'Ď' => 'd', 'ď' => 'd', 'Đ' => 'd', 'đ' => 'd', 'Ē' => 'e',
    'ē' => 'e', 'Ĕ' => 'e', 'ĕ' => 'e', 'Ė' => 'e', 'ė' => 'e', 'Ę' => 'e',
    'ę' => 'e', 'Ě' => 'e', 'ě' => 'e', 'Ĝ' => 'g', 'ĝ' => 'g', 'Ğ' => 'g',
    'ğ' => 'g', 'Ġ' => 'g', 'ġ' => 'g', 'Ģ' => 'g', 'ģ' => 'g', 'Ĥ' => 'h',
    'ĥ' => 'h', 'Ħ' => 'h', 'ħ' => 'h', 'Ĩ' => 'i', 'ĩ' => 'i', 'Ī' => 'i',
    'ī' => 'i', 'Ĭ' => 'i', 'ĭ' => 'i', 'Į' => 'i', 'į' => 'i', 'İ' => 'i',
    'ı' => 'i', 'IJ' => 'ij','ij' => 'ij','Ĵ' => 'j', 'ĵ' => 'j', 'Ķ' => 'k',
    'ķ' => 'k', 'ĸ' => 'k', 'Ĺ' => 'l', 'ĺ' => 'l', 'Ļ' => 'l', 'ļ' => 'l',
    'Ľ' => 'l', 'ľ' => 'l', 'Ŀ' => 'l', 'ŀ' => 'l', 'Ł' => 'l', 'ł' => 'l',
    'Ń' => 'n', 'ń' => 'n', 'Ņ' => 'n', 'ņ' => 'n', 'Ň' => 'n', 'ň' => 'n',
    'ʼn' => 'n', 'Ŋ' => 'n', 'ŋ' => 'n', 'Ō' => 'o', 'ō' => 'o', 'Ŏ' => 'o',
    'ŏ' => 'o', 'Ő' => 'o', 'ő' => 'o', 'Œ' => 'oe','œ' => 'oe','Ŕ' => 'r',
    'ŕ' => 'r', 'Ŗ' => 'r', 'ŗ' => 'r', 'Ř' => 'r', 'ř' => 'r', 'Ś' => 's',
    'ś' => 's', 'Ŝ' => 's', 'ŝ' => 's', 'Ş' => 's', 'ş' => 's', 'Š' => 's',
    'š' => 's', 'Ţ' => 't', 'ţ' => 't', 'Ť' => 't', 'ť' => 't', 'Ŧ' => 't',
    'ŧ' => 't', 'Ũ' => 'u', 'ũ' => 'u', 'Ū' => 'u', 'ū' => 'u', 'Ŭ' => 'u',
    'ŭ' => 'u', 'Ů' => 'u', 'ů' => 'u', 'Ű' => 'u', 'ű' => 'u', 'Ų' => 'u',
    'ų' => 'u', 'Ŵ' => 'w', 'ŵ' => 'w', 'Ŷ' => 'y', 'ŷ' => 'y', 'Ÿ' => 'y',
    'Ź' => 'z', 'ź' => 'z', 'Ż' => 'z', 'ż' => 'z', 'Ž' => 'z', 'ž' => 'z',
    'ſ' => 'z', 'Ə' => 'e', 'ƒ' => 'f', 'Ơ' => 'o', 'ơ' => 'o', 'Ư' => 'u',
    'ư' => 'u', 'Ǎ' => 'a', 'ǎ' => 'a', 'Ǐ' => 'i', 'ǐ' => 'i', 'Ǒ' => 'o',
    'ǒ' => 'o', 'Ǔ' => 'u', 'ǔ' => 'u', 'Ǖ' => 'u', 'ǖ' => 'u', 'Ǘ' => 'u',
    'ǘ' => 'u', 'Ǚ' => 'u', 'ǚ' => 'u', 'Ǜ' => 'u', 'ǜ' => 'u', 'Ǻ' => 'a',
    'ǻ' => 'a', 'Ǽ' => 'ae','ǽ' => 'ae','Ǿ' => 'o', 'ǿ' => 'o', 'ə' => 'e',
    'Ё' => 'jo','Є' => 'e', 'І' => 'i', 'Ї' => 'i', 'А' => 'a', 'Б' => 'b',
    'В' => 'v', 'Г' => 'g', 'Д' => 'd', 'Е' => 'e', 'Ж' => 'zh','З' => 'z',
    'И' => 'i', 'Й' => 'j', 'К' => 'k', 'Л' => 'l', 'М' => 'm', 'Н' => 'n',
    'О' => 'o', 'П' => 'p', 'Р' => 'r', 'С' => 's', 'Т' => 't', 'У' => 'u',
    'Ф' => 'f', 'Х' => 'h', 'Ц' => 'c', 'Ч' => 'ch','Ш' => 'sh','Щ' => 'sch',
    'Ъ' => '-', 'Ы' => 'y', 'Ь' => '-', 'Э' => 'je','Ю' => 'ju','Я' => 'ja',
    'а' => 'a', 'б' => 'b', 'в' => 'v', 'г' => 'g', 'д' => 'd', 'е' => 'e',
    'ж' => 'zh','з' => 'z', 'и' => 'i', 'й' => 'j', 'к' => 'k', 'л' => 'l',
    'м' => 'm', 'н' => 'n', 'о' => 'o', 'п' => 'p', 'р' => 'r', 'с' => 's',
    'т' => 't', 'у' => 'u', 'ф' => 'f', 'х' => 'h', 'ц' => 'c', 'ч' => 'ch',
    'ш' => 'sh','щ' => 'sch','ъ' => '-','ы' => 'y', 'ь' => '-', 'э' => 'je',
    'ю' => 'ju','я' => 'ja','ё' => 'jo','є' => 'e', 'і' => 'i', 'ї' => 'i',
    'Ґ' => 'g', 'ґ' => 'g', 'א' => 'a', 'ב' => 'b', 'ג' => 'g', 'ד' => 'd',
    'ה' => 'h', 'ו' => 'v', 'ז' => 'z', 'ח' => 'h', 'ט' => 't', 'י' => 'i',
    'ך' => 'k', 'כ' => 'k', 'ל' => 'l', 'ם' => 'm', 'מ' => 'm', 'ן' => 'n',
    'נ' => 'n', 'ס' => 's', 'ע' => 'e', 'ף' => 'p', 'פ' => 'p', 'ץ' => 'C',
    'צ' => 'c', 'ק' => 'q', 'ר' => 'r', 'ש' => 'w', 'ת' => 't', '™' => 'tm',
);

From magento, im using it for basically everything

user3682119
  • 99
  • 1
  • 5
  • 5
    Pretty nice. Who's magento? – BurninLeo Sep 19 '14 at 08:30
  • 1
    This should be in a built-in function in all web languages, for translating non valid URL characters while maintaining readable and SEO friendly URLs, since the alternative is currently to URL encode thus making the URL ugly, long, and unreadable. Of course it cant be made to efficiently support many Asian languages, but this covers most others. Worth noting that this ugly looking solution is much better than using iconv with //TRANSLIT which will leave you with many question marks and also must know the imput encoding to convert. – ekerner Jan 24 '15 at 17:17
  • When compared to the above postings, these characters may be added: `'Ã' => 'A', 'ã' => 'a', 'Þ' => 'B', 'Ê' => 'E', 'Ñ' => 'N', 'ð' => 'o', 'ñ' => 'n', 'ș' => 's', 'Ș' => 'S', 'ț' => 't', 'Ț' => 'T'` – BurninLeo Jan 29 '15 at 11:36
  • 2
    FYI @BurninLeo The letter 'ð' should not be substituted with 'o', as it is the icelandic letter for something closer to 'd' – daker Feb 10 '15 at 09:15
6

I've searched and your idea for accent striping is quite awesome and cost-effective but your regex is wrongly done and misses 2 extra params. Long story short the regex must be:

$patterns[0] = '/[áâàåä]/ui';
$patterns[1] = '/[ðéêèë]/ui';
$patterns[2] = '/[íîìï]/ui';
$patterns[3] = '/[óôòøõö]/ui';
$patterns[4] = '/[úûùü]/ui';
$patterns[5] = '/æ/ui';
$patterns[6] = '/ç/ui';
$patterns[7] = '/ß/ui';
$replacements[0] = 'a';
$replacements[1] = 'e';
$replacements[2] = 'i';
$replacements[3] = 'o';
$replacements[4] = 'u';
$replacements[5] = 'ae';
$replacements[6] = 'c';
$replacements[7] = 'ss';

As you can see is quite similar but the most important thing is the paramas after the second slash of the regular expression. When a regualr expression is like this /[someCoolRegex]/ui the u specifies that it must use unicode and the i specifies that is case insensitive, I've tested my own and with the ansewer in this forum I must say is more cost efective than using strtr.

Hope someone reads this answer.

Colorman
  • 89
  • 2
  • 5
5

Disclaimer: I'm not supporting this answer anymore (I was blind at that time). But thanks for the up-votes =P

You can take this as basis. From WordPress, used to generate pretty urls (the entry point is the slugify() function):

/**
 * Converts all accent characters to ASCII characters.
 *
 * If there are no accent characters, then the string given is just returned.
 *
 * @param string $string Text that might have accent characters
 * @return string Filtered string with replaced "nice" characters.
 */

function remove_accents($string) {
 if (!preg_match('/[\x80-\xff]/', $string))
  return $string;
 if (seems_utf8($string)) {
  $chars = array(
  // Decompositions for Latin-1 Supplement
  chr(195).chr(128) => 'A', chr(195).chr(129) => 'A',
  chr(195).chr(130) => 'A', chr(195).chr(131) => 'A',
  chr(195).chr(132) => 'A', chr(195).chr(133) => 'A',
  chr(195).chr(135) => 'C', chr(195).chr(136) => 'E',
  chr(195).chr(137) => 'E', chr(195).chr(138) => 'E',
  chr(195).chr(139) => 'E', chr(195).chr(140) => 'I',
  chr(195).chr(141) => 'I', chr(195).chr(142) => 'I',
  chr(195).chr(143) => 'I', chr(195).chr(145) => 'N',
  chr(195).chr(146) => 'O', chr(195).chr(147) => 'O',
  chr(195).chr(148) => 'O', chr(195).chr(149) => 'O',
  chr(195).chr(150) => 'O', chr(195).chr(153) => 'U',
  chr(195).chr(154) => 'U', chr(195).chr(155) => 'U',
  chr(195).chr(156) => 'U', chr(195).chr(157) => 'Y',
  chr(195).chr(159) => 's', chr(195).chr(160) => 'a',
  chr(195).chr(161) => 'a', chr(195).chr(162) => 'a',
  chr(195).chr(163) => 'a', chr(195).chr(164) => 'a',
  chr(195).chr(165) => 'a', chr(195).chr(167) => 'c',
  chr(195).chr(168) => 'e', chr(195).chr(169) => 'e',
  chr(195).chr(170) => 'e', chr(195).chr(171) => 'e',
  chr(195).chr(172) => 'i', chr(195).chr(173) => 'i',
  chr(195).chr(174) => 'i', chr(195).chr(175) => 'i',
  chr(195).chr(177) => 'n', chr(195).chr(178) => 'o',
  chr(195).chr(179) => 'o', chr(195).chr(180) => 'o',
  chr(195).chr(181) => 'o', chr(195).chr(182) => 'o',
  chr(195).chr(182) => 'o', chr(195).chr(185) => 'u',
  chr(195).chr(186) => 'u', chr(195).chr(187) => 'u',
  chr(195).chr(188) => 'u', chr(195).chr(189) => 'y',
  chr(195).chr(191) => 'y',
  // Decompositions for Latin Extended-A
  chr(196).chr(128) => 'A', chr(196).chr(129) => 'a',
  chr(196).chr(130) => 'A', chr(196).chr(131) => 'a',
  chr(196).chr(132) => 'A', chr(196).chr(133) => 'a',
  chr(196).chr(134) => 'C', chr(196).chr(135) => 'c',
  chr(196).chr(136) => 'C', chr(196).chr(137) => 'c',
  chr(196).chr(138) => 'C', chr(196).chr(139) => 'c',
  chr(196).chr(140) => 'C', chr(196).chr(141) => 'c',
  chr(196).chr(142) => 'D', chr(196).chr(143) => 'd',
  chr(196).chr(144) => 'D', chr(196).chr(145) => 'd',
  chr(196).chr(146) => 'E', chr(196).chr(147) => 'e',
  chr(196).chr(148) => 'E', chr(196).chr(149) => 'e',
  chr(196).chr(150) => 'E', chr(196).chr(151) => 'e',
  chr(196).chr(152) => 'E', chr(196).chr(153) => 'e',
  chr(196).chr(154) => 'E', chr(196).chr(155) => 'e',
  chr(196).chr(156) => 'G', chr(196).chr(157) => 'g',
  chr(196).chr(158) => 'G', chr(196).chr(159) => 'g',
  chr(196).chr(160) => 'G', chr(196).chr(161) => 'g',
  chr(196).chr(162) => 'G', chr(196).chr(163) => 'g',
  chr(196).chr(164) => 'H', chr(196).chr(165) => 'h',
  chr(196).chr(166) => 'H', chr(196).chr(167) => 'h',
  chr(196).chr(168) => 'I', chr(196).chr(169) => 'i',
  chr(196).chr(170) => 'I', chr(196).chr(171) => 'i',
  chr(196).chr(172) => 'I', chr(196).chr(173) => 'i',
  chr(196).chr(174) => 'I', chr(196).chr(175) => 'i',
  chr(196).chr(176) => 'I', chr(196).chr(177) => 'i',
  chr(196).chr(178) => 'IJ',chr(196).chr(179) => 'ij',
  chr(196).chr(180) => 'J', chr(196).chr(181) => 'j',
  chr(196).chr(182) => 'K', chr(196).chr(183) => 'k',
  chr(196).chr(184) => 'k', chr(196).chr(185) => 'L',
  chr(196).chr(186) => 'l', chr(196).chr(187) => 'L',
  chr(196).chr(188) => 'l', chr(196).chr(189) => 'L',
  chr(196).chr(190) => 'l', chr(196).chr(191) => 'L',
  chr(197).chr(128) => 'l', chr(197).chr(129) => 'L',
  chr(197).chr(130) => 'l', chr(197).chr(131) => 'N',
  chr(197).chr(132) => 'n', chr(197).chr(133) => 'N',
  chr(197).chr(134) => 'n', chr(197).chr(135) => 'N',
  chr(197).chr(136) => 'n', chr(197).chr(137) => 'N',
  chr(197).chr(138) => 'n', chr(197).chr(139) => 'N',
  chr(197).chr(140) => 'O', chr(197).chr(141) => 'o',
  chr(197).chr(142) => 'O', chr(197).chr(143) => 'o',
  chr(197).chr(144) => 'O', chr(197).chr(145) => 'o',
  chr(197).chr(146) => 'OE',chr(197).chr(147) => 'oe',
  chr(197).chr(148) => 'R',chr(197).chr(149) => 'r',
  chr(197).chr(150) => 'R',chr(197).chr(151) => 'r',
  chr(197).chr(152) => 'R',chr(197).chr(153) => 'r',
  chr(197).chr(154) => 'S',chr(197).chr(155) => 's',
  chr(197).chr(156) => 'S',chr(197).chr(157) => 's',
  chr(197).chr(158) => 'S',chr(197).chr(159) => 's',
  chr(197).chr(160) => 'S', chr(197).chr(161) => 's',
  chr(197).chr(162) => 'T', chr(197).chr(163) => 't',
  chr(197).chr(164) => 'T', chr(197).chr(165) => 't',
  chr(197).chr(166) => 'T', chr(197).chr(167) => 't',
  chr(197).chr(168) => 'U', chr(197).chr(169) => 'u',
  chr(197).chr(170) => 'U', chr(197).chr(171) => 'u',
  chr(197).chr(172) => 'U', chr(197).chr(173) => 'u',
  chr(197).chr(174) => 'U', chr(197).chr(175) => 'u',
  chr(197).chr(176) => 'U', chr(197).chr(177) => 'u',
  chr(197).chr(178) => 'U', chr(197).chr(179) => 'u',
  chr(197).chr(180) => 'W', chr(197).chr(181) => 'w',
  chr(197).chr(182) => 'Y', chr(197).chr(183) => 'y',
  chr(197).chr(184) => 'Y', chr(197).chr(185) => 'Z',
  chr(197).chr(186) => 'z', chr(197).chr(187) => 'Z',
  chr(197).chr(188) => 'z', chr(197).chr(189) => 'Z',
  chr(197).chr(190) => 'z', chr(197).chr(191) => 's',
  // Euro Sign
  chr(226).chr(130).chr(172) => 'E',
  // GBP (Pound) Sign
  chr(194).chr(163) => '');
  $string = strtr($string, $chars);
 } else {
  // Assume ISO-8859-1 if not UTF-8
  $chars['in'] = chr(128).chr(131).chr(138).chr(142).chr(154).chr(158)
   .chr(159).chr(162).chr(165).chr(181).chr(192).chr(193).chr(194)
   .chr(195).chr(196).chr(197).chr(199).chr(200).chr(201).chr(202)
   .chr(203).chr(204).chr(205).chr(206).chr(207).chr(209).chr(210)
   .chr(211).chr(212).chr(213).chr(214).chr(216).chr(217).chr(218)
   .chr(219).chr(220).chr(221).chr(224).chr(225).chr(226).chr(227)
   .chr(228).chr(229).chr(231).chr(232).chr(233).chr(234).chr(235)
   .chr(236).chr(237).chr(238).chr(239).chr(241).chr(242).chr(243)
   .chr(244).chr(245).chr(246).chr(248).chr(249).chr(250).chr(251)
   .chr(252).chr(253).chr(255);
  $chars['out'] = "EfSZszYcYuAAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy";
  $string = strtr($string, $chars['in'], $chars['out']);
  $double_chars['in'] = array(chr(140), chr(156), chr(198), chr(208), chr(222), chr(223), chr(230), chr(240), chr(254));
  $double_chars['out'] = array('OE', 'oe', 'AE', 'DH', 'TH', 'ss', 'ae', 'dh', 'th');
  $string = str_replace($double_chars['in'], $double_chars['out'], $string);
 }
 return $string;
}

/**
 * Checks to see if a string is utf8 encoded.
 *
 * @author bmorel at ssi dot fr
 *
 * @param string $Str The string to be checked
 * @return bool True if $Str fits a UTF-8 model, false otherwise.
 */
function seems_utf8($Str) { # by bmorel at ssi dot fr
 $length = strlen($Str);
 for ($i = 0; $i < $length; $i++) {
  if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
  elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n = 1; # 110bbbbb
  elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n = 2; # 1110bbbb
  elseif ((ord($Str[$i]) & 0xF8) == 0xF0) $n = 3; # 11110bbb
  elseif ((ord($Str[$i]) & 0xFC) == 0xF8) $n = 4; # 111110bb
  elseif ((ord($Str[$i]) & 0xFE) == 0xFC) $n = 5; # 1111110b
  else return false; # Does not match any model
  for ($j = 0; $j < $n; $j++) { # n bytes matching 10bbbbbb follow ?
   if ((++$i == $length) || ((ord($Str[$i]) & 0xC0) != 0x80))
   return false;
  }
 }
 return true;
}

function utf8_uri_encode($utf8_string, $length = 0) {
 $unicode = '';
 $values = array();
 $num_octets = 1;
 $unicode_length = 0;
 $string_length = strlen($utf8_string);
 for ($i = 0; $i < $string_length; $i++) {
  $value = ord($utf8_string[$i]);
  if ($value < 128) {
   if ($length && ($unicode_length >= $length))
    break;
   $unicode .= chr($value);
   $unicode_length++;
  } else {
   if (count($values) == 0) $num_octets = ($value < 224) ? 2 : 3;
   $values[] = $value;
   if ($length && ($unicode_length + ($num_octets * 3)) > $length)
    break;
   if (count( $values ) == $num_octets) {
    if ($num_octets == 3) {
     $unicode .= '%' . dechex($values[0]) . '%' . dechex($values[1]) . '%' . dechex($values[2]);
     $unicode_length += 9;
    } else {
     $unicode .= '%' . dechex($values[0]) . '%' . dechex($values[1]);
     $unicode_length += 6;
    }
    $values = array();
    $num_octets = 1;
   }
  }
 }
 return $unicode;
}

/**
 * Sanitizes title, replacing whitespace with dashes.
 *
 * Limits the output to alphanumeric characters, underscore (_) and dash (-).
 * Whitespace becomes a dash.
 *
 * @param string $title The title to be sanitized.
 * @return string The sanitized title.
 */
function slugify($title) {
 $title = strip_tags($title);
 // Preserve escaped octets.
 $title = preg_replace('|%([a-fA-F0-9][a-fA-F0-9])|', '---$1---', $title);
 // Remove percent signs that are not part of an octet.
 $title = str_replace('%', '', $title);
 // Restore octets.
 $title = preg_replace('|---([a-fA-F0-9][a-fA-F0-9])---|', '%$1', $title);
 $title = remove_accents($title);
 if (seems_utf8($title)) {
  if (function_exists('mb_strtolower')) {
   $title = mb_strtolower($title, 'UTF-8');
  }
  $title = utf8_uri_encode($title, 200);
 }
 $title = strtolower($title);
 $title = preg_replace('/&.+?;/', '', $title); // kill entities
 $title = preg_replace('/[^%a-z0-9 _-]/', '', $title);
 $title = preg_replace('/\s+/', '-', $title);
 $title = preg_replace('|-+|', '-', $title);
 $title = trim($title, '-');
 return $title;
}
Keyne Viana
  • 6,194
  • 2
  • 24
  • 55
  • Thanks for this. I wanted to do this on a Wordpress site and didn't realize Wordpress had a built-in function for it :) – Matt Browne Feb 13 '17 at 18:55
3

You can use PHP strtr() function to get rid of accented characters :

$string = "Éric Cantona";
$accented_array = array('Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E','Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U','Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c','è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o','ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y' );

$required_str = strtr( $string, $accented_array );
Rahul Gupta
  • 991
  • 4
  • 12
2

strtolower only works on iso-8859-1 encoded strings. You could try with mb_strtolower.

Or, if you have to mangle with multibyte-extensions, you might as well use iconv's transliteration support:

iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text);

Edit:

It seems I was a bit fast. You appear to use iso-8859-1, so your current strategy will work. You just need to write the regexp's properly. Eg.:

'/(ð|é|ê|è|ë)/'

not:

'/[ð|é|ê|è|ë]/'
troelskn
  • 115,121
  • 27
  • 131
  • 155
  • I would never take the regexp route unless there is no choice; use iconv to ASCII//TRANSLIT – mvds Jul 30 '10 at 13:18
  • @NullUserException I've heard about that, but my provider won't even upgrade to PHP 5.3 as that would 'break too many old scripts'. On an unrelated note, my favourite Perl has had UTF-8 support for years :P (though I never used it for CGI). – MvanGeest Jul 30 '10 at 13:23
  • @NullUserException: The old PHP6 plans were scrapped. – Daniel Egeberg Jul 30 '10 at 13:36
  • 1
    @MvanGeest Note that you can use utf-8 with PHP as of today. You just need to be aware of a few pitfalls (Eg. most string-functions expect the input to be latin1). But it's certainly doable, and I would generally recommend that for any new applications. – troelskn Jul 30 '10 at 15:16
1

I know, that question has been asked a long long time ago...

I was looking for a short and elegant solution, but couldn't find satisfaction for two reasons:

First, most of the existing solutions replace a list of characters by a list of other characters. Unfortunately, it require to use a specific encoding for the php script file itself which might be unwanted.

Second, using iconv seems to be a good way, but it's not enough as the result of a converted character could be one or two characters, or a Fatal Exception.

So I wrote that small function which does the job :

function replaceAccent($string, $replacement = '_')
{
    $alnumPattern = '/^[a-zA-Z0-9 ]+$/';

    if (preg_match($alnumPattern, $string)) {
        return $string;
    }

    $ret = array_map(
        function ($chr) use ($alnumPattern, $replacement) {
            if (preg_match($alnumPattern, $chr)) {
                return $chr;
            } else {
                $chr = @iconv('ISO-8859-1', 'ASCII//TRANSLIT', $chr);
                if (strlen($chr) == 1) {
                    return $chr;
                } elseif (strlen($chr) > 1) {
                    $ret = '';
                    foreach (str_split($chr) as $char2) {
                        if (preg_match($alnumPattern, $char2)) {
                            $ret .= $char2;
                        }
                    }
                    return $ret;
                } else {
                    // replace whatever iconv fail to convert by something else
                    return $replacement;
                }
            }
        },
        str_split($string)
    );

    return implode($ret);
}
frenus
  • 481
  • 4
  • 9
1

As an alternative (a bit more complex in nature through), have a look at how wordpress does accent removal. Made some changes below to make it run independently without referencing wordpress functions.

     function mbstring_binary_safe_encoding($reset = false)
{
    static $encodings  = array();
    static $overloaded = null;

    if (is_null($overloaded)) {
        $overloaded = function_exists('mb_internal_encoding') && (ini_get('mbstring.func_overload') & 2);
    }

    if (false === $overloaded) {
        return;
    }

    if (!$reset) {
        $encoding = mb_internal_encoding();
        array_push($encodings, $encoding);
        mb_internal_encoding('ISO-8859-1');
    }

    if ($reset && $encodings) {
        $encoding = array_pop($encodings);
        mb_internal_encoding($encoding);
    }
}

function seems_utf8($str)
{
    mbstring_binary_safe_encoding();
    $length = strlen($str);
    mbstring_binary_safe_encoding(true);
    for ($i = 0; $i < $length; $i++) {
        $c = ord($str[$i]);
        if ($c < 0x80) {
            $n = 0;
        }
        // 0bbbbbbb
        elseif (($c & 0xE0) == 0xC0) {
            $n = 1;
        }
        // 110bbbbb
        elseif (($c & 0xF0) == 0xE0) {
            $n = 2;
        }
        // 1110bbbb
        elseif (($c & 0xF8) == 0xF0) {
            $n = 3;
        }
        // 11110bbb
        elseif (($c & 0xFC) == 0xF8) {
            $n = 4;
        }
        // 111110bb
        elseif (($c & 0xFE) == 0xFC) {
            $n = 5;
        }
        // 1111110b
        else {
                return false;
            }
            // Does not match any model
            for ($j = 0; $j < $n; $j++) {
                // n bytes matching 10bbbbbb follow ?
                if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80)) {
                    return false;
                }

            }
        }
        return true;
    }

    function remove_accents($string)
{
        if (!preg_match('/[\x80-\xff]/', $string)) {
            return $string;
        }

        if (seems_utf8($string)) {
            $chars = array(
                // Decompositions for Latin-1 Supplement
                'ª' => 'a', 'º'  => 'o',
                'À' => 'A', 'Á'  => 'A',
                'Â' => 'A', 'Ã'  => 'A',
                'Ä' => 'A', 'Å'  => 'A',
                'Æ' => 'AE', 'Ç' => 'C',
                'È' => 'E', 'É'  => 'E',
                'Ê' => 'E', 'Ë'  => 'E',
                'Ì' => 'I', 'Í'  => 'I',
                'Î' => 'I', 'Ï'  => 'I',
                'Ð' => 'D', 'Ñ'  => 'N',
                'Ò' => 'O', 'Ó'  => 'O',
                'Ô' => 'O', 'Õ'  => 'O',
                'Ö' => 'O', 'Ù'  => 'U',
                'Ú' => 'U', 'Û'  => 'U',
                'Ü' => 'U', 'Ý'  => 'Y',
                'Þ' => 'TH', 'ß' => 's',
                'à' => 'a', 'á'  => 'a',
                'â' => 'a', 'ã'  => 'a',
                'ä' => 'a', 'å'  => 'a',
                'æ' => 'ae', 'ç' => 'c',
                'è' => 'e', 'é'  => 'e',
                'ê' => 'e', 'ë'  => 'e',
                'ì' => 'i', 'í'  => 'i',
                'î' => 'i', 'ï'  => 'i',
                'ð' => 'd', 'ñ'  => 'n',
                'ò' => 'o', 'ó'  => 'o',
                'ô' => 'o', 'õ'  => 'o',
                'ö' => 'o', 'ø'  => 'o',
                'ù' => 'u', 'ú'  => 'u',
                'û' => 'u', 'ü'  => 'u',
                'ý' => 'y', 'þ'  => 'th',
                'ÿ' => 'y', 'Ø'  => 'O',
                // Decompositions for Latin Extended-A
                'Ā' => 'A', 'ā'  => 'a',
                'Ă' => 'A', 'ă'  => 'a',
                'Ą' => 'A', 'ą'  => 'a',
                'Ć' => 'C', 'ć'  => 'c',
                'Ĉ' => 'C', 'ĉ'  => 'c',
                'Ċ' => 'C', 'ċ'  => 'c',
                'Č' => 'C', 'č'  => 'c',
                'Ď' => 'D', 'ď'  => 'd',
                'Đ' => 'D', 'đ'  => 'd',
                'Ē' => 'E', 'ē'  => 'e',
                'Ĕ' => 'E', 'ĕ'  => 'e',
                'Ė' => 'E', 'ė'  => 'e',
                'Ę' => 'E', 'ę'  => 'e',
                'Ě' => 'E', 'ě'  => 'e',
                'Ĝ' => 'G', 'ĝ'  => 'g',
                'Ğ' => 'G', 'ğ'  => 'g',
                'Ġ' => 'G', 'ġ'  => 'g',
                'Ģ' => 'G', 'ģ'  => 'g',
                'Ĥ' => 'H', 'ĥ'  => 'h',
                'Ħ' => 'H', 'ħ'  => 'h',
                'Ĩ' => 'I', 'ĩ'  => 'i',
                'Ī' => 'I', 'ī'  => 'i',
                'Ĭ' => 'I', 'ĭ'  => 'i',
                'Į' => 'I', 'į'  => 'i',
                'İ' => 'I', 'ı'  => 'i',
                'IJ' => 'IJ', 'ij' => 'ij',
                'Ĵ' => 'J', 'ĵ'  => 'j',
                'Ķ' => 'K', 'ķ'  => 'k',
                'ĸ' => 'k', 'Ĺ'  => 'L',
                'ĺ' => 'l', 'Ļ'  => 'L',
                'ļ' => 'l', 'Ľ'  => 'L',
                'ľ' => 'l', 'Ŀ'  => 'L',
                'ŀ' => 'l', 'Ł'  => 'L',
                'ł' => 'l', 'Ń'  => 'N',
                'ń' => 'n', 'Ņ'  => 'N',
                'ņ' => 'n', 'Ň'  => 'N',
                'ň' => 'n', 'ʼn'  => 'n',
                'Ŋ' => 'N', 'ŋ'  => 'n',
                'Ō' => 'O', 'ō'  => 'o',
                'Ŏ' => 'O', 'ŏ'  => 'o',
                'Ő' => 'O', 'ő'  => 'o',
                'Œ' => 'OE', 'œ' => 'oe',
                'Ŕ' => 'R', 'ŕ'  => 'r',
                'Ŗ' => 'R', 'ŗ'  => 'r',
                'Ř' => 'R', 'ř'  => 'r',
                'Ś' => 'S', 'ś'  => 's',
                'Ŝ' => 'S', 'ŝ'  => 's',
                'Ş' => 'S', 'ş'  => 's',
                'Š' => 'S', 'š'  => 's',
                'Ţ' => 'T', 'ţ'  => 't',
                'Ť' => 'T', 'ť'  => 't',
                'Ŧ' => 'T', 'ŧ'  => 't',
                'Ũ' => 'U', 'ũ'  => 'u',
                'Ū' => 'U', 'ū'  => 'u',
                'Ŭ' => 'U', 'ŭ'  => 'u',
                'Ů' => 'U', 'ů'  => 'u',
                'Ű' => 'U', 'ű'  => 'u',
                'Ų' => 'U', 'ų'  => 'u',
                'Ŵ' => 'W', 'ŵ'  => 'w',
                'Ŷ' => 'Y', 'ŷ'  => 'y',
                'Ÿ' => 'Y', 'Ź'  => 'Z',
                'ź' => 'z', 'Ż'  => 'Z',
                'ż' => 'z', 'Ž'  => 'Z',
                'ž' => 'z', 'ſ'  => 's',
                // Decompositions for Latin Extended-B
                'Ș' => 'S', 'ș'  => 's',
                'Ț' => 'T', 'ț'  => 't',
                // Euro Sign
                '€' => 'E',
                // GBP (Pound) Sign
                '£' => '',
                // Vowels with diacritic (Vietnamese)
                // unmarked
                'Ơ' => 'O', 'ơ'  => 'o',
                'Ư' => 'U', 'ư'  => 'u',
                // grave accent
                'Ầ' => 'A', 'ầ'  => 'a',
                'Ằ' => 'A', 'ằ'  => 'a',
                'Ề' => 'E', 'ề'  => 'e',
                'Ồ' => 'O', 'ồ'  => 'o',
                'Ờ' => 'O', 'ờ'  => 'o',
                'Ừ' => 'U', 'ừ'  => 'u',
                'Ỳ' => 'Y', 'ỳ'  => 'y',
                // hook
                'Ả' => 'A', 'ả'  => 'a',
                'Ẩ' => 'A', 'ẩ'  => 'a',
                'Ẳ' => 'A', 'ẳ'  => 'a',
                'Ẻ' => 'E', 'ẻ'  => 'e',
                'Ể' => 'E', 'ể'  => 'e',
                'Ỉ' => 'I', 'ỉ'  => 'i',
                'Ỏ' => 'O', 'ỏ'  => 'o',
                'Ổ' => 'O', 'ổ'  => 'o',
                'Ở' => 'O', 'ở'  => 'o',
                'Ủ' => 'U', 'ủ'  => 'u',
                'Ử' => 'U', 'ử'  => 'u',
                'Ỷ' => 'Y', 'ỷ'  => 'y',
                // tilde
                'Ẫ' => 'A', 'ẫ'  => 'a',
                'Ẵ' => 'A', 'ẵ'  => 'a',
                'Ẽ' => 'E', 'ẽ'  => 'e',
                'Ễ' => 'E', 'ễ'  => 'e',
                'Ỗ' => 'O', 'ỗ'  => 'o',
                'Ỡ' => 'O', 'ỡ'  => 'o',
                'Ữ' => 'U', 'ữ'  => 'u',
                'Ỹ' => 'Y', 'ỹ'  => 'y',
                // acute accent
                'Ấ' => 'A', 'ấ'  => 'a',
                'Ắ' => 'A', 'ắ'  => 'a',
                'Ế' => 'E', 'ế'  => 'e',
                'Ố' => 'O', 'ố'  => 'o',
                'Ớ' => 'O', 'ớ'  => 'o',
                'Ứ' => 'U', 'ứ'  => 'u',
                // dot below
                'Ạ' => 'A', 'ạ'  => 'a',
                'Ậ' => 'A', 'ậ'  => 'a',
                'Ặ' => 'A', 'ặ'  => 'a',
                'Ẹ' => 'E', 'ẹ'  => 'e',
                'Ệ' => 'E', 'ệ'  => 'e',
                'Ị' => 'I', 'ị'  => 'i',
                'Ọ' => 'O', 'ọ'  => 'o',
                'Ộ' => 'O', 'ộ'  => 'o',
                'Ợ' => 'O', 'ợ'  => 'o',
                'Ụ' => 'U', 'ụ'  => 'u',
                'Ự' => 'U', 'ự'  => 'u',
                'Ỵ' => 'Y', 'ỵ'  => 'y',
                // Vowels with diacritic (Chinese, Hanyu Pinyin)
                'ɑ' => 'a',
                // macron
                'Ǖ' => 'U', 'ǖ'  => 'u',
                // acute accent
                'Ǘ' => 'U', 'ǘ'  => 'u',
                // caron
                'Ǎ' => 'A', 'ǎ'  => 'a',
                'Ǐ' => 'I', 'ǐ'  => 'i',
                'Ǒ' => 'O', 'ǒ'  => 'o',
                'Ǔ' => 'U', 'ǔ'  => 'u',
                'Ǚ' => 'U', 'ǚ'  => 'u',
                // grave accent
                'Ǜ' => 'U', 'ǜ'  => 'u',
            );

            $string = strtr($string, $chars);
        } else {
            $chars = array();
            // Assume ISO-8859-1 if not UTF-8
            $chars['in'] = "\x80\x83\x8a\x8e\x9a\x9e"
                . "\x9f\xa2\xa5\xb5\xc0\xc1\xc2"
                . "\xc3\xc4\xc5\xc7\xc8\xc9\xca"
                . "\xcb\xcc\xcd\xce\xcf\xd1\xd2"
                . "\xd3\xd4\xd5\xd6\xd8\xd9\xda"
                . "\xdb\xdc\xdd\xe0\xe1\xe2\xe3"
                . "\xe4\xe5\xe7\xe8\xe9\xea\xeb"
                . "\xec\xed\xee\xef\xf1\xf2\xf3"
                . "\xf4\xf5\xf6\xf8\xf9\xfa\xfb"
                . "\xfc\xfd\xff";

            $chars['out'] = "EfSZszYcYuAAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy";

            $string              = strtr($string, $chars['in'], $chars['out']);
            $double_chars        = array();
            $double_chars['in']  = array("\x8c", "\x9c", "\xc6", "\xd0", "\xde", "\xdf", "\xe6", "\xf0", "\xfe");
            $double_chars['out'] = array('OE', 'oe', 'AE', 'DH', 'TH', 'ss', 'ae', 'dh', 'th');
            $string              = str_replace($double_chars['in'], $double_chars['out'], $string);
        }

        return $string;
    }
Selay
  • 6,024
  • 2
  • 27
  • 23
1

Vietnamese characters for those who need them

'Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
                            'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U',
                            'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c',
                            'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o',
                            'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y' );
$str = strtr( $str, $unwanted_array );
Dino
  • 7,779
  • 12
  • 46
  • 85
Phuong Le
  • 11
  • 1
0

Adding a little bit to what Lizard said, it worked to display correctly on web page, but added some other codes to complete what I was looking for replacing my tags to search correctly into my database with special characters. Thanks in advance.

$unwanted_array = array(    'Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
                            'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U',
                            'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c',
                            'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o',
                            'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y', 
                            '&#225;'=>'a', '&#233;'=>'e', '&#237;'=>'i', '&#243;'=>'o', '&#250;'=>'u',  
                            '&#193;'=>'A', '&#201;'=>'E', '&#205;'=>'I', '&#211;'=>'O', '&#218;'=>'U',
                            '&#209;'=>'N', '&#241;'=>'n' );
$newtag = strtr( $newtag, $unwanted_array );
Luis H Cabrejo
  • 302
  • 1
  • 8
0

For All who wants to transform this umlauts to germany they can use this method:

public function handleGermanUmlauts(string $name) : string
{
// we need this line for preg_replace can work
    $name = htmlentities($name, ENT_COMPAT, 'UTF-8');
// this line is adding `e` character instead of suffix, except for `ee`
    $name = preg_replace('/&([a-df-zA-DF-Z])(uml|acute|grave|circ|tilde|ring);/', '$1e', $name);
// this line will make next line working for using iconv method
    $name = html_entity_decode($name);
// with iconv we are transferring all other characters like EUR and etc.
    $name = str_replace(array("\"", "'", "`", "^", "~"), "", iconv("utf-8", "ASCII//TRANSLIT", $name));

    return $name;
}
Farid shahidi
  • 318
  • 4
  • 9
-1

Well, my favorite is a function which replaces german umlauts and uses iconv afterwards. If you want a nice seo slug, you'll have e.g. "Ä" as "ae" etc.

function slugifyText($text) {
    $arr1 = Array('ä','ö','ü','Ä','Ö','Ü','ß');
    $arr2 = Array('ae', 'oe', 'ue', 'ae', 'oe', 'ue', 'ss');
    $text = str_replace($arr1, $arr2, $text);
    $text = iconv('UTF-8','ASCII//TRANSLIT',$text);
    $text = preg_replace("/[^a-zA-Z0-9_-]/", "", strtolower($text));
    return $text;
}
Marco
  • 3,470
  • 4
  • 23
  • 35