1

I've been reading a few other questions but I am still stuck on the problem of converting strings containing accented characters into plain characters (by which I mean a-z)

I have a product name "Áhkká" which is already encoded as "Áhkká"

I want to decode this to the string with accents, and then convert it to read "Ahkka"

So far, I have tried:

function convert($name) {
   $name = html_entity_decode($name,ENT_COMPAT,"UTF-8");
   $name = iconv('UTF-8', 'ASCII//TRANSLIT', $name);
   return $name;
}

I get an error from iconv: "Detected an illegal character in input string"

I have also tried using htmlspecialchars_decode($name); but that gives me �hkk�

I also found a string replace function to clear accents, but I can't seem to pass a non-html string to it

$name = strtr($name,'àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ','aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY');

Can someone please offer a solution? The server is running PHP 5.2.13. iconv is enabled glibc 2.5 (input/internal/output encoding is ISO-8859-1 in phpinfo)

Alexander Holsgrove
  • 1,795
  • 3
  • 25
  • 54

1 Answers1

1

Trying to find a solution to your problem I have found this question:

multibyte strtr() -> mb_strtr()

In the chosen answer Alix Axel writes a function which is exactly what you need:

function Unaccent($string)
{
    return preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml|caron);~i', '$1', htmlentities($string, ENT_QUOTES, 'UTF-8'));
}

echo Unaccent(html_entity_decode('Áhkká'));

prints Ahkka

Community
  • 1
  • 1
Michal Trojanowski
  • 10,641
  • 2
  • 22
  • 41
  • This looks promising, but I end up with "ampaacutehkkampaacute" as the output. I isolated my code and that was the output. I've updated my answer with the test code based on your answer. Perhaps I need to set my default charset? – Alexander Holsgrove Oct 02 '12 at 22:13
  • Have you tried decoding html entities with the parameters you used before? `html_entity_decode($name,ENT_COMPAT,"UTF-8");` Do you get a proper UTF-8 string after decoding? If you do the `Unaccent` function should work well. – Michal Trojanowski Oct 03 '12 at 07:20
  • Sorry for my mistake. Your code did indeed work, but I also need to remove ' type entries. Secondly, This code will be run around 400,000 times along with 2 more preg_replace to make a "clean" url - to clean some product URLs - is this the fastest way? – Alexander Holsgrove Oct 03 '12 at 08:24
  • I've accepted your answer as it is a solution, although in the end I did use iconv (needed to setlocate); – Alexander Holsgrove Oct 03 '12 at 08:53