48

What is the most efficient way to remove accents from a string e.g. ÈâuÑ becomes Eaun?

Is there a simple, built in way that I'm missing or a regular expression?

Ripon Al Wasim
  • 36,924
  • 42
  • 155
  • 176
Mark Lalor
  • 7,820
  • 18
  • 67
  • 106
  • 7
    @Peeps: telling users to search google is against Stack Overflow's etiquette. If the question doesn't exist on the website it's better for everyone if it is asked, even if the OP already knows the answer, since it will increase our number of non-duplicate questions. So maybe next time if someone searches it with google they will find this very question, and we will have one more user. – Andreas Bonini Aug 22 '10 at 18:30
  • @Andreas good point. However, this is most certainly a SO duplicate, so Peeps kind of has a small point :) I'm too lazy to search for it right now, though. – Pekka Aug 22 '10 at 18:33

5 Answers5

57

I found a solution, that worked in all my test-cases (copied from http://php.net/manual/en/transliterator.transliterate.php):

var_dump(transliterator_transliterate('Any-Latin; Latin-ASCII; [\u0080-\u7fff] remove',
    "A æ Übérmensch på høyeste nivå! И я люблю PHP! есть. fi ¦"));
// string(50) "A ae Ubermensch pa hoyeste niva! I a lublu PHP! est. fi "

see: http://www.php.net/normalizer

EDIT: This solution is independent of the locale set using setlocale(). Another benefit over iconv() is, that even non-latin characters are not ignored.

EDIT2: I discovered, that there are some characters, that are not covered by the transliteration I posted originally. Any-Latin translates the cyrillic character ь to a character, that doesn't fit into a latin character-set: ʹ (http://en.wikipedia.org/wiki/Prime_%28symbol%29). I've added [\u0100-\u7fff] remove to remove all these non-latin characters. I also added a test to the text ;)

I suggest, that they mean the latin alphabet and not one of the latin character-sets by Latin here. But anyways - in my opinion, they should transliterate it to something ASCII then in Latin-ASCII ...

EDIT3: Sorry for another change here. I had to take the characters down to u0080 instead of u0100, to get only ASCII characters as output. The test above is updated.

SimonSimCity
  • 6,415
  • 3
  • 39
  • 52
  • 4
    Note: it needs `php_intl.dll` extension enabled – Oriol Aug 29 '13 at 18:42
  • I agree, this was the best function for me too! (and I tried many) – lokers Jan 07 '14 at 00:03
  • Really good solution, very easy to use and most useful that others solutions using str_replace. – Baptiste Donaux Aug 18 '14 at 12:22
  • 4
    Should be noted that this will not just transliterate the text (as OP asked), but will remove some chracters too. eg € (euro sign) will be removed. Just pass 'Any-Latin; Latin-ASCII;' as the first param to keep those. Optionally, you can then use iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $str) to transform "€" to "EUR". – Skacc Feb 04 '15 at 15:33
57

If you have iconv installed, try this (the example assumes your input string is in UTF-8):

echo iconv('UTF-8', 'ASCII//TRANSLIT', $string);

(iconv is a library to convert between all kinds of encodings; it's efficient and included with many PHP distributions by default. Most of all, it's definitely easier and more error-proof than trying to roll your own solution (did you know that there's a "Latin letter N with a curl"? Me neither.))

Piskvor left the building
  • 91,498
  • 46
  • 177
  • 222
  • 11
    +1 Beat me to it. This should work best. However, note that this tends to fail if there are invalid characters in the input (using `ASCII//TRANSLIT//IGNORE` should help) and as so often, if encountering problems, the User Contributed Notes are a good read. http://www.php.net/manual/en/function.iconv.php – Pekka Aug 22 '10 at 18:28
  • 5
    For some reason, sometimes I can't get this to work. See http://codepad.viper-7.com/SUufA4 But in another machine, I got "`E^au~N". Not was desired, though. – Artefacto Aug 22 '10 at 18:38
  • Nice, simple and small and works...for me – Mark Lalor Aug 22 '10 at 18:38
  • 1
    This inconv has some conflicts so I will ask a similar question – Mark Lalor Aug 22 '10 at 18:40
  • 6
    This did not work for me at first. Accent characters just became ? characters. As per a comment on iconv() on the PHP manual page, I first ran: setlocale(LC_ALL,'en_CA.utf8'); and then everything worked perfectly. The 'en_CA.utf8' was the default locale on my system. Try "locale -a" to see a list of available locales – Professor Falken Feb 23 '13 at 01:49
  • This icon() solution works for many characters, but not all. For example, "Colbjørnsensgade" becomes "Colbj?rnsensgade". That's why the transliterator_transliterate() solution by SimonSimCity is usually a better choice (but requires the right libraries installed to work). – orrd Jun 06 '15 at 05:47
  • This doesn't work for all russian characters – Sam Ivichuk Jul 31 '15 at 09:34
  • 7
    This fixed the question marks for me. `setlocale(LC_ALL, "en_US.utf8"); $string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);` – Josh Bernfeld Jun 25 '16 at 23:18
  • Just another upvote for these comments here. I spent a few hours today trying to debug why my ASCiI//TRANSLIT//IGNORE code wasn't working on a German a umlaut. My development platform worked fine. The live server failed. After trying a thousand things, the setlocale worked fine - added to to both. – Scott Jun 06 '17 at 14:06
19

Reposting this on request of @palantir ...

I find iconv completely unreliable, and I dislike preg_replace solutions and big arrays ... so my favorite way (and the only reliable method I've found) is ...

function toASCII( $str )
{
    return strtr(utf8_decode($str), 
        utf8_decode(
        'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
        'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}
Eric Leschinski
  • 146,994
  • 96
  • 417
  • 335
designosis
  • 5,182
  • 1
  • 38
  • 57
  • 1
    you should also put in the following letters: `ő`, `Ő`, `ű`, `Ű`. Thanks. :) – Sk8erPeter Nov 29 '12 at 11:21
  • 15
    This is not reliable method. Not working for polish accented chars like `ŻŹĆŃĄŚŁĘÓżźćńąśłęó`. Try `var_dump(strtr(utf8_decode('qqqqŻŹĆŃĄŚŁĘÓżźćńąśłęóqqq'), utf8_decode('ŻŹĆŃĄŚŁĘÓżźćńąśłęó'),'ZZCNASLEOzzcnasleo'));` I got `string(25) "qqqqeeeeeeeeOeeeeeeeeoqqq"`. Iconv is more reliable `var_dump(iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', 'qqqqŻŹĆŃĄŚŁĘÓżźćńąśłęóqqq'));` and I get `string(25) "qqqqZZCNASLEOzzcnasleoqqq"` – piotrekkr May 18 '13 at 21:22
  • 2
    converts 'Горловка' for me to YYYYYYYY , not good – Tebe Oct 28 '16 at 07:24
  • It's not the best in terms of performance and it also produces incorrect result. Letters like Œ, Æ, etc. should decompose to two letters, not to one. – Hasnat Safder May 22 '19 at 08:35
13

You can use iconv to transliterate the characters to plain US-ASCII and then use a regular expression to remove non-alphabetic characters:

preg_replace('/[^a-z]/i', '', iconv("UTF-8", "US-ASCII//TRANSLIT", $text))

Another way would be using the Normalizer to normalize to the Normalization Form KD (NFKD) and then remove the mark characters:

preg_replace('/\p{Mn}/u', '', Normalizer::normalize($text, Normalizer::FORM_KD))
Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • `ISO-8859-1`? Are you sure? Won't this leave at least ÄÖÜ in place (as their 8859-1 counterparts)? – Pekka Aug 22 '10 at 18:32
  • 1
    What’s the reason for the down vote? – Gumbo Aug 22 '10 at 18:32
  • 1
    Downvote isn't mine. However, the OP is not asking to remove non-alphabetic characters, is he? – Pekka Aug 22 '10 at 18:34
  • It was mine. Reverted now that you fixed it. – Artefacto Aug 22 '10 at 18:35
  • 2
    @Pekka: The transliteration of `ÈâuÑ` using `iconv` gives `\`E^au~N`. That’s why the following cleanup is used. – Gumbo Aug 22 '10 at 18:39
  • @Gumbo I see. I'm sorry, we have had this discussion in a duplicate somewhere already :) +1 for the most complete solution, then, that should be made the accepted one. *Update:* If I had any votes left – Pekka Aug 22 '10 at 18:40
  • By the way, what you say and your code don't match once again. FORM_D makes more sense. – Artefacto Aug 22 '10 at 18:47
  • @Artefacto: Thanks for the remark; fixed it. And take a look at figure 6 in http://unicode.org/reports/tr15/#Norm_Forms. – Gumbo Aug 22 '10 at 18:52
  • @Gumbo OK, I guess it's a matter of preference, though strictly that normalization won't take care only of the marks. See also the other question of the OP. I took some, erm, inspiration from you (basically only replaced the [a-z] regex you then had with \p{M} and left Normalizer::FORM_D. – Artefacto Aug 22 '10 at 19:09
  • The normalize function works for me. – Peter Stuifzand Dec 17 '12 at 15:07
  • @Gumbo your `Normalizer` solution is quite good but in my case two characters `Ł` and `ł` are left untouched. My code: `var_dump(preg_replace('/\p{Mn}/u', '',Normalizer::normalize('qqqqŻŹĆŃĄŚŁĘÓżźćńąśłęóqqq', Normalizer::FORM_KD)));` and I get back: `string(27) "qqqqZZCNASŁEOzzcnasłeoqqq"`. `iconv` works best for me. – piotrekkr May 18 '13 at 21:34
12

Note: I'm reposting this from another similar question in the hope that it's helpful to others.

I ended up writing a PHP library based on URLify.js from the Django project, since I found iconv() to be too incomplete. You can find it here:

https://github.com/jbroadway/urlify

Handles Latin characters as well as Greek, Turkish, Russian, Ukrainian, Czech, Polish, and Latvian.

Johnny Broadway
  • 651
  • 6
  • 3