0

I am putting some user-provided content in my URLs for SEO purposes, using this code to clean it up:

/**
* Create URL friendly strings or filenames
* @param type $str
* @param type $replace
* @param type $delimiter
* @return type
*/
public static function toAscii($str, $replace=array(), $delimiter='-') {
  if(!empty($replace)) {
    $str = str_replace((array)$replace, ' ', $str);
  }
  $clean = iconv('UTF-8', 'ASCII//TRANSLIT', $str);
  $clean = preg_replace("/[^a-zA-Z0-9\/_|+ -]/", '', $clean);
  $clean = strtolower(trim($clean, '-'));
  $clean = preg_replace("/[\/_|+ -]+/", $delimiter, $clean);
  return $clean;
}

However, I found out it is not enough. An article with some Hebrew characters gave me:

iconv(): Detected an illegal character in input string

Is there a silver-bullet function out there to safely make strings into pretty URLs? At the very least I would like it NOT to crash. Then, it'd be nice if the URL still looked nice and SEO-friendly.

Today it was Hebrew, but tomorrow it may be Russian, Chinese, Klingon...

Nathan H
  • 48,033
  • 60
  • 165
  • 247
  • I would just use percent encoding. See http://stackoverflow.com/questions/2742852/unicode-characters-in-urls Your main target never should be the search engines, but the users. When users copy paste those URLs, they will see the decoded characters (at least in most cases). Search engines in general can also "understand" what the percent encoded string. – methode Dec 09 '12 at 14:49
  • Interesting idea. Even with classical characters to avoid such as quotes and stuff? – Nathan H Dec 09 '12 at 14:51
  • Yes, percent encoding would be your best bet in my opinion. – methode Dec 09 '12 at 14:52
  • urlencode would be safe enough then? – Nathan H Dec 09 '12 at 15:01
  • I would go with rawurlencode, mainly because Jonathan says so: http://stackoverflow.com/a/6998242/317491 . – methode Dec 09 '12 at 15:18
  • possible duplicate of [Remove accents without using iconv](http://stackoverflow.com/questions/3542818/remove-accents-without-using-iconv) – dynamic Dec 11 '12 at 11:52
  • `iconv` complaining most likely means your input is not actually UTF-8 encoded. – deceze Dec 11 '12 at 11:54
  • @llnk my post isn't about accents at all. Hebrew and Arabic for example won't really fit into this. – Nathan H Dec 11 '12 at 12:13
  • @methode One of my post had a date like 2/12/12 as part of the URL, which was correctly converted to 2%2F12%2F12, but the server interpreted this URL as a 404. – Nathan H Dec 12 '12 at 06:10

0 Answers0