3

I have a site that I'm building out with php that will allow for multi language for content. One part of the site will have business listings. I have SEO friendly urls setup to view these listings, so for example I would have a business listing called "A bar down the street". The url would look like this:

/listing/a-bar-down-the-street

However lets say there is an Arabic version of this listing, then the name would look like this:

شريط أسفل الشارع

How would I make that into the same url format as the English version but in the language it is currently in? When I tried my function on the Arabic version that turns a string into a seo friendly url it comes back empty.

EDIT: To clarify further, all I'm looking for is a php function that allows me to turn any string into an SEO friendly url no matter what language the site is in.

EDIT PART 2 Below is the function Im using to rewrite the string to a SEO friendly url. Perhaps you can tell me what I need to add to make it language friendly?

    public function urlTitle($str,$separator = 'dash',$lowercase = TRUE)
    {

        if ($separator == 'dash')
        {

            $search     = '_';
            $replace    = '-';

        }else
        {

            $search     = '-';
            $replace    = '_';

        }

        $trans = array(
                        '&\#\d+?;'              => '',
                        '&\S+?;'                => '',
                        '\s+'                   => $replace,
                        '[^a-z0-9\-_]'          => '',
                        $replace.'+'            => $replace,
                        $replace.'$'            => $replace,
                        '^'.$replace            => $replace,
                        '\.+$'                  => ''
                        );

        $str = strip_tags($str);
        $str = preg_replace("#\/#ui",'-',$str);

        foreach ($trans AS $key => $val)
        {

            $str = preg_replace("#".$key."#ui", $val, $str);

        }

        if($lowercase === TRUE)
        {

            $str = mb_strtolower($str);

        }

        return trim(stripslashes($str));

    }
John
  • 9,840
  • 26
  • 91
  • 137
  • http://www.stackoverflow.com/questions/9511254/how-to-create-unicode-slug-for-unicode-title this link might help you – uttam May 14 '12 at 16:19
  • @uttam Unfortunately I don't have the normalizer installed on the server and don't think I could get it installed. – John May 14 '12 at 16:34

4 Answers4

1

I have found similar discussion in an existing SO discussion. It seems that what you are requesting should be possible "out-of-the-box".

I would recommend looking into your webserver config to see what is the problem, there should not be a difference between seo-friendly English urls and any other url-encodable string.

What webserver are you running?

UPDATE I see that you are only accepting alphanumeric characters:

'[^a-z0-9\-_]'          => '',

I suspect that may filter out any non-a-z characters and cause the empty return. Or, alternatively, you can try to debug your function to see which of the replace condition causes your content to be wiped-out.

What you are encountering here is that URLs by default cannot contain any character, browsers in general use encoding to achieve nice-looking multi language URLs.

See example from link:

URLs are allowed only a certain set of english letter characters, which includes the numbers, dashes, slashes, and the question mark. All other characters have to be encoded, which applies to non-Latin domain names. If you go to فنادق.com, you will notice that some browsers will decode it and show you فنادق.com but some like Chrome will show you something like this http://www.xn--mgbq6cgr.com/.

Which means that you can no longer filter your post title and only allow url-valid characters, you need to encode the titles and hope that the browser will render them as you would like.

Another option would be to use trans-literation, possibly after detection of a browser which is known to not render the url-encoded special characters.

Community
  • 1
  • 1
petr
  • 2,554
  • 3
  • 20
  • 29
  • I dont think my webserver is the problem or mod_rewrite. I believe its the php function Im using that is turning the url into a SEO friendly url. If I manually copy the arabic text and put it into the url it works just fine. I just need to know via php how can I turn a string into a SEO friendly url no matter what language it is in. Also Im running Apache. – John May 14 '12 at 16:11
  • then I think we need more information about the php function you are using as it probably has problems with other character encodings – petr May 14 '12 at 16:13
  • Ok just posted my php function. – John May 14 '12 at 16:17
  • So how would I encode the string? Right now if I just remove the '[^a-z0-9\-_]' => '' part of the script it now looks like this: /listing/��������-��������-������������ – John May 14 '12 at 16:44
  • Ok I think I found the problem why its giving me the question marks. It has nothing to do with the preg_replace part, its the lower case part. The mb_strtolower is causing it to give me those characters. Once I remove this one like it works just fine. Does converting it to lower case matter for a different language? – John May 14 '12 at 17:01
  • Another interesting thing I found out, the mb_strtolower works only if I specify the utf-8 string. I thought by default anything mb would handle utf8. So using this worked: $str = mb_strtolower($str,'UTF-8'); – John May 14 '12 at 17:03
  • So after removing the line that checks for english characters and adding the utf-8 flag to the mb_strtolower function it appears the SEO friendly url looks correct now. When I ran it for the arabic text in my original post it now has the dashes in between the text – John May 14 '12 at 17:20
0

So what seems to work for me is taken out this part of my php function:

'[^a-z0-9\-_]'          => '',

And updating the strtolower line to:

$str = mb_strtolower($str,'UTF-8');

And it seems to work as normal. However can anyone confirm this will work going forward? Will browsers understand this for all languages? Or do I have to normalize the string to make sure every browser can understand the url? The problem is I'm not on php 5.3, which is required to install the normalization extension for php. I'm afraid it will break things if I do upgrade, I'm currently on 5.2x.

John
  • 9,840
  • 26
  • 91
  • 137
  • Regarding it working forward - please have a look at my post where I am saying that about browsers. Only way to be 100% compatible is to use trans-literation or detect browser and redirect. See last two – petr May 14 '12 at 19:29
  • @petr I found this, https://github.com/jbroadway/urlify. It came from this post, http://stackoverflow.com/questions/1284535/php-transliteration – John May 14 '12 at 20:25
0

John, you're right, the main problem is that your regex character class ([^a-z0-9\-_]) doesn't allow UTF-8 characters. This should work better: [^\p{L}0-9\-_]

I had been working on a function like this recently and just published a blog post that includes the function I came up with: Creating SEO Friendly URLs in PHP with url_slug()

0

I have a site with 48 different languages we support. The function I use to clean the urls is here (in javascript), perhaps this is helpful for you:

const noHyphenLangs = ['ko', 'ja', 'zh-cn', 'zh-tw', 'ar', 'th']
const formatTranslationIntoPath = (text, symbol) => { // utf-8 encoding
  let t = text
  const replaceChar = noHyphenLangs.includes(symbol) ? '' : '-'
  t = t.replace(/-/g, ' ')
  t = t.replace(/\s/g, replaceChar)
  t = t.replace(/['`’]/g, '') // remove quotes
  t = t.replace(/[,,()]/g, '') // remove junk
  t = t.normalize('NFD').replace(/\p{Diacritic}/gu, '') // simplify letters for url https://stackoverflow.com/questions/990904/remove-accents-diacritics-in-a-string-in-javascript
  t = t.replace(/[Łł]/g, 'l') // doesn't get replaced in diacritic replacements

  return t.toLowerCase()
}

const ex1 = formatTranslationIntoPath('让我们  尝试-这样-做', 'zh-cn') // 让我们尝试这样做
const ex2 = formatTranslationIntoPath('Việt miễn phí', 'vi') // viet-mien-phi

PS: For most languages, you don't want to remove the non-alpha-numeric characters if there is no diacritic replacements available.

Ref: https://gist.github.com/KevinDanikowski/24c79cbb7a3ef2a7f3e452e740848249

Kevin Danikowski
  • 4,620
  • 6
  • 41
  • 75