1

I want to sanitize blog titles with unicode characters in url. I need to replace invalid characters and spaces with "-" for better seo rewriting like this.

        http://example.com/это-моя-хорошая

Can anyone have any idea how to do it?

uttam
  • 589
  • 2
  • 7
  • 33

1 Answers1

3

You can use this algorithm for an SEO-friendly Unicode URL:

  1. Convert the text to Unicode Normalization Form C, i.e. precomposed characters.
  2. Use a regular expression with Unicode character classes to replace each non-letter non-digit character with a space.
  3. Remove leading, trailing and double spaces.
  4. Shorten.
  5. Replace spaces with hyphens.
Anthony Faull
  • 17,549
  • 5
  • 55
  • 73
  • Thanks for the algorithm. I started looking after normalization of unicode characters and found this link http://www.php.net/manual/en/normalizer.normalize.php . Is this the correct function to normalize or are there any other library or function that could normalize unicode characters? – uttam Mar 05 '12 at 01:09
  • @uttam That's right. In PHP you can use Normalizer::normalize. – Anthony Faull Mar 05 '12 at 08:33