9

imagine a page Title string in any given language (english, arabic, japanese etc) containing several words in UTF-8. Example:

$stringRAW = "Blues & μπλουζ Bliss's ブルース Schön";

Now this actually needs to be converted into something thats a valid portion of a URL of that page:

$stringURL = "blues-μπλουζ-bliss-ブルース-schön"

just check out this link This works on my server too!

Q1. What characters are allowed as valid URL these days? I remember having seen whol arabic strings sitting on the browser and i tested it on my apache 2 and all worked fine.

I guesse it must become: $stringURL = "blues-blows-bliss-black"

Q2. What existing php functions do you know that encode/convert these UTF-8 strings correctly for URL ripping them off of any invalid chars?

I guesse that at least: 1. spaces should be converted into dashes -
2. delete invalid characters? which are they? @ and '&'?
3. converts all letters to lower case (or are capitcal letters valid in urls?)

Thanks: your suggestions are much appreciated!

Sam
  • 15,254
  • 25
  • 90
  • 145
  • 1
    Strongly related: http://stackoverflow.com/questions/465990/how-to-handle-diacritics-accents-when-rewriting-pretty-urls – Pekka Mar 07 '11 at 16:28
  • 1
    `(ripping it of any invalid things like ' or & or spaces)` -- These aren't technically invalid. They just must be encoded via `urlencode` – Kevin Peno Mar 07 '11 at 16:51
  • Awesome link Pekka! +1 Thats one quality link especially on the foreign chars matters – Sam Mar 07 '11 at 16:55
  • 1
    Kevin, that changes then everything doesnt it? To my surprise something like `/Café` is allowed too. I thin I am on the verge of awakening from the middleages... and acknowledge that there is more allowed than I thought (since the Renaissance...) – Sam Mar 07 '11 at 16:56

5 Answers5

11

this is solution which I use:

$text = 'Nevalidní Český text';
$text = preg_replace('/[^\\pL0-9]+/u', '-', $text);
$text = trim($text, "-");
$text = iconv("utf-8", "us-ascii//TRANSLIT", $text);
$text = preg_replace('/[^-a-z0-9]+/i', '', $text);

Capitals in URL's are not a problem, but if you want the text to be lowercase then simply add $text = strtolower($text); at the end :-).

grongor
  • 1,305
  • 1
  • 13
  • 26
7

I would use:

$stringURL = str_replace(' ', '-', $stringURL); // Converts spaces to dashes
$stringURL = urlencode($stringURL);
Josh
  • 1,794
  • 3
  • 17
  • 31
  • 1
    Because urlencode replaces spaces with '+'. Whereas, he requested spaces to be replaced with dashes. – Josh Mar 07 '11 at 16:51
2
$stringURL = preg_replace('~[^a-z ]~', '', str_replace(' ', '-', $stringRAW));

Check this method: http://www.whatstyle.net/articles/52/generate_unique_slugs_in_cakephp

powtac
  • 40,542
  • 28
  • 115
  • 170
2

pick the title of your webpage $title = "mytitle#$3%#$5345"; simply urlencode it

$url = urlencode($title);

you dont need to worry about small details but remember to identify your url request its best to use a unique id prefix in url such as /389894/sdojfsodjf , during routing process you can use id 389894 to get the topic sdojfsodjf .

Mr Coder
  • 8,169
  • 5
  • 45
  • 74
  • 1
    Since most browsers now-a-days show the unencoded url (unless you paste an encoded one in some cases), I tend to prefer this option as well. – Kevin Peno Mar 07 '11 at 16:46
  • indeed i have some ide like language `/en/` and the file `/tomato/` which identifies to get tomato.php in english. and then I would like to add the title so `/en/tomato/whatever-blabla` Is this okay? any hyperlinks that can help me setup this last added portion via htaccess/apache? – Sam Mar 07 '11 at 19:20
  • @Kevin, what do you mean by "show the unencoded url" do you mean they all are compatible with the unencoded url anywaz, or do you mean the show code wrongly and we always *should* use urlencode(); ? – Sam Mar 07 '11 at 19:21
  • @Sam, what I mean is that in the past browser, such as Firefox, use to show the URL as is (meaning %20, %35, etc encoding for special url characters all appeared in the URL bar of the browser). Now-a-days, I cannot think of a time I've given a browser a urlencoded url and it has not translated that into human speak for me. Thus, cleanly appearing in the browser url box. Regardless, you must always urlencode urls. – Kevin Peno Mar 07 '11 at 19:35
  • Roger that! thanks for clearing that up. Indeed even strange arabic / japanese code just appears as readable in firefox url bar... (albeit unreadably to my personal linguistic little knowledge :) – Sam Mar 07 '11 at 20:27
1

Here is a short & handy one that does the trick for me

$title = trim(strtolower($title));  // lower string, removes white spaces and linebreaks at the start/end
$title = preg_replace('#[^a-z0-9\s-]#',null, $title); // remove all unwanted chars
$title = preg_replace('#[\s-]+#','-', $title); // replace white spaces and - with - (otherwise you end up with ---)

and of course you need to handle umlauts, currency signs and so forth depending on the possible input

Hannes
  • 8,147
  • 4
  • 33
  • 51