5

So I'm building a website that is using a database feed that was already set up and has been used by the client for all their other websites for quite some time.

They fill this database through an external program, and I have no way to change the way I get my data.

Now I have the following problem, sometimes I get strings in UTF-8 and sometimes in ASCII (I hope I've got these terms right, they're still a bit vague to me sometimes).

So I could get either this: Scénic or Scénic.

Now the problem is, I have to convert this to non-special characters (so it would become Scenic) for urls.

I don't think there's a function for converting é to e (if there is do tell) so I'll probably need to create an array for that containing all the source and destinations, but the bigger problem is converting é to é without breaking é when it comes through that function.

Or should I just create an array containing everything
(so for example: array('é'=>'e','é'=>'e'); etc.

I know how to get é to é, by doing utf8_encode(html_entity_decode('é')), however putting é through this same function will return é.

Maybe I'm approaching this the wrong way, but in that case I'd love to know how I should approach it.

Kokos
  • 9,051
  • 5
  • 27
  • 44
  • 2
    html_entity_decode("éé",ENT_COMPAT,"UTF-8") works correctly for me - it outputs "éé". Maby you just forgot to set encoding? This should work on UTF-8 and on first 128 ASCII bits(plain text) as they have the same values in UTF-8 - it seems that it is exactly your case. – XzKto Sep 23 '11 at 14:15
  • Ah, well that solves half the problem :) – Kokos Sep 23 '11 at 14:19
  • Does iconv("UTF-8","ASCII//TRANSLIT","ééé") solve the second part? – XzKto Sep 23 '11 at 14:27
  • `html_entity_decode('éeé',ENT_COMPAT,"UTF-8");` actually yields `�eé` for me (DOCTYPE html and meta charset UTF-8). On codepad.org it returns `éeé` though. If I then do `iconv("UTF-8","ASCII//TRANSLIT",$input);` on that I get an empty string. And `iconv` doesn't work on codepad so I can't test it there. – Kokos Sep 23 '11 at 14:36
  • I must be doing something wrong somewhere else, If I run this: http://ideone.com/QjoQk on my website I get this output: `string(4) "�eé" string(0) "" string(13) "ccc�aaadfgdfg" string(3) "ccc" string(7) "Citro�n" string(5) "Citro"` – Kokos Sep 23 '11 at 14:55
  • Hmm, it was something with the locale. When I set the locale to de_DE.UTF-8 according to this [post](http://www.php.net/manual/en/function.mb-detect-encoding.php#89915) it works. The problem is I set my locale for other functions, and these two mess each other up =/ – Kokos Sep 23 '11 at 15:00
  • Also do try `ßæıLJ`. That should become `ssaeiLJ`. – MSalters Nov 15 '11 at 13:52

1 Answers1

12

Thanks to @XzKto and this comment on PHP.net I changed my slug function to the following:

static function slug($input){

    $string = html_entity_decode($input,ENT_COMPAT,"UTF-8");

    $oldLocale = setlocale(LC_CTYPE, '0');  

    setlocale(LC_CTYPE, 'en_US.UTF-8');
    $string = iconv("UTF-8","ASCII//TRANSLIT",$string);

    setlocale(LC_CTYPE, $oldLocale);

    return strtolower(preg_replace('/[^a-zA-Z0-9]+/','-',$string));

}

I feel like the setlocale part is a bit dirty but this works perfectly for translating special characters to their 'normal' equivalents.

Input a áñö ïß éèé returns a-ano-iss-eee

Kokos
  • 9,051
  • 5
  • 27
  • 44