6

We need to generate a unique URL from the title of a book - where the title can contain any character. How can we search-replace all the 'invalid' characters so that a valid and neat lookoing URL is generated?

For instance:

"The Great Book of PHP"

www.mysite.com/book/12345/the-great-book-of-php

"The Greatest !@#$ Book of PHP"

www.mysite.com/book/12345/the-greatest-book-of-php

"Funny title     "

www.mysite.com/book/12345/funny-title
codaddict
  • 445,704
  • 82
  • 492
  • 529
siliconpi
  • 8,105
  • 18
  • 69
  • 107
  • Take a look at this question: http://stackoverflow.com/questions/2668854/php-sanitizing-strings-to-make-them-url-and-filename-safe – fabrik Oct 21 '10 at 06:57
  • 1
    As there is some confusion: What do you mean by valid character? – Gumbo Oct 21 '10 at 07:34
  • possible duplicate of [Regular Expression Sanitize (PHP)](http://stackoverflow.com/questions/3022185/regular-expression-sanitize-php) – SilentGhost Oct 21 '10 at 13:41

8 Answers8

17

Ah, slugification

// This function expects the input to be UTF-8 encoded.
function slugify($text)
{
    // Swap out Non "Letters" with a -
    $text = preg_replace('/[^\\pL\d]+/u', '-', $text); 

    // Trim out extra -'s
    $text = trim($text, '-');

    // Convert letters that we have left to the closest ASCII representation
    $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);

    // Make text lowercase
    $text = strtolower($text);

    // Strip out anything we haven't been able to convert
    $text = preg_replace('/[^-\w]+/', '', $text);

    return $text;
}

This works fairly well, as it first uses the unicode properties of each character to determine if it's a letter (or \d against a number) - then it converts those that aren't to -'s - then it transliterates to ascii, does another replacement for anything else, and then cleans up after itself. (Fabrik's test returns "arvizturo-tukorfurogep")

I also tend to add in a list of stop words - so that those are removed from the slug. "the" "of" "or" "a", etc (but don't do it on length, or you strip out stuff like "php")

Mez
  • 24,430
  • 14
  • 71
  • 93
  • Simple yet brilliant! +++ ;) (Now wondering what's that hocus-pocus inside WP source :o) – fabrik Oct 21 '10 at 14:07
  • the Unicode matching only works on 5.1+ and iconv might not be installed on some servers - they have to cater for everyong. – Mez Oct 21 '10 at 17:02
  • If I may suggest an edit, I've added `$text = utf8_encode($text);` at the first line. Without this conversion, a string such as `Mon titre français` returned blank, whereas now it becomes `mon-titre-francais`. – davewoodhall Oct 02 '13 at 14:29
  • @PubliDesign Then your internal encoding is not set to UTF-8. You can enforce this by using `mb_internal_encoding('UTF-8')` or setting [responsible INI values](http://stackoverflow.com/questions/12710842/php-internal-encoding). Your string is working out-of-the-box with @Mez's code. – althaus Oct 08 '13 at 10:42
  • @althaus, The original code doesn't force the string to be utf8, which may result in wierd unwanted characters (ex: ? in a black triangle). Having tried this string with the added `$text = utf8_encode($text);`, I have had great results after several tests. – davewoodhall Oct 08 '13 at 14:28
7

If “invalid” means non-alphanumeric, you can do this:

function foo($str) {
    return trim(preg_replace('/[^a-z0-9]+/', '-', strtolower($str)), '-');
}

This will turn $str into lowercase, replace any sequence of one or more non-alphanumeric characters by one hyphen, and then remove leading and trailing hyphens.

var_dump(foo("The Great Book of PHP") === 'the-great-book-of-php');
var_dump(foo("The Greatest !@#$ Book of PHP") === 'the-greatest-book-of-php');
var_dump(foo("Funny title     ") === 'funny-title');
Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • Fails too. Sorry. Please read the question: "the title can contain any character" – fabrik Oct 21 '10 at 07:07
  • @fabrik: So what’s wrong? Didn’t you test the examples? They all yield true. – Gumbo Oct 21 '10 at 07:08
  • @fabrik: “If ‘invalid’ means non-alphanumeric […]” – matt_tm didn’t say anything about what invalid means. I just assumed that he means non-alphanumeric. – Gumbo Oct 21 '10 at 07:29
  • @Gumbo: Thank you for at least trying to understand what i'm talking about. Not only hungarian characters but given a book about Citroën and there you go. Accented characters in an international brand's name. Yes, OP didn't specified what is invalid and what is not but as he stated "the title can contain **any** character". (And, because we talking about books, there's a chance for accented characters.) – fabrik Oct 21 '10 at 07:34
  • Hi - sorry to barge in your conversation and yes, non-English characters should be accounted for as well... Its not a terrible requirement that the 'visible' title be absolutely the same as the actual title, but it MUST be a valid url... – siliconpi Oct 22 '10 at 13:47
2

You can use a simple regular expression for this purpose:

<?php
    function safeurl( $v )
    {
        $v = strtolower( $v );
        $v = preg_replace( "/[^a-z0-9]+/", "-", $v );
        $v = trim( $v, "-" );
        return $v;
    }
    echo "<br>www.mysite.com/book/12345/" . safeurl( "The Great Book of PHP" );
    echo "<br>www.mysite.com/book/12345/" . safeurl( "The Greatest !@#$ Book of PHP" );
    echo "<br>www.mysite.com/book/12345/" . safeurl( "  Funny title  " );
    echo "<br>www.mysite.com/book/12345/" . safeurl( "!!Even Funnier title!!" );
?>
Salman A
  • 262,204
  • 82
  • 430
  • 521
1

If you want to allow only letters, digits and underscore (usual word characters) you can do:

$str = strtolower(preg_replace(array('/\W/','/-+/','/^-|-$/'),array('-','-',''),$str));

It first replaces any non-word character(\W) with a -.
Next it replaces any consecutive - with a single -
Next it deletes any leading or trailing -.

Working link

codaddict
  • 445,704
  • 82
  • 492
  • 529
  • 1
    Go ahead and downvote Gumbo too. I bet you're having a bad day. – Salman A Oct 21 '10 at 07:06
  • @Salman: Please understand it's not an easy preg_replace: http://core.trac.wordpress.org/browser/tags/3.0.1/wp-includes/formatting.php – fabrik Oct 21 '10 at 07:11
1

This code comes from CodeIgniter's url helper. It should do the trick.

function url_title($str, $separator = 'dash', $lowercase = FALSE)
    {
        if ($separator == 'dash')
        {
            $search     = '_';
            $replace    = '-';
        }
        else
        {
            $search     = '-';
            $replace    = '_';
        }

        $trans = array(
                        '&\#\d+?;'              => '',
                        '&\S+?;'                => '',
                        '\s+'                   => $replace,
                        '[^a-z0-9\-\._]'        => '',
                        $replace.'+'            => $replace,
                        $replace.'$'            => $replace,
                        '^'.$replace            => $replace,
                        '\.+$'                  => ''
                      );

        $str = strip_tags($str);

        foreach ($trans as $key => $val)
        {
            $str = preg_replace("#".$key."#i", $val, $str);
        }

        if ($lowercase === TRUE)
        {
            $str = strtolower($str);
        }

        return trim(stripslashes($str));
    }
thomaux
  • 19,133
  • 10
  • 76
  • 103
0

Replace special chars for white spaces and then replace white spaces for "-". str_replace?

0

Use a regex replace to remove all non word characters. For example:

str_replace('[^a-zA-Z]+', '-', $input)

Ward Bekker
  • 6,316
  • 9
  • 38
  • 61
0
<?php
$input = "  The Great Book's of PHP  ";
$output = trim(preg_replace(array("`'`", "`[^a-z]+`"),  array("", "-"), strtolower($input)), "-");
echo $output; // the-great-books-of-php

This trims trailing dashes and doesn't do things like "it's raining" -> "it-s-raining" as most solutions tend to do.

mpen
  • 272,448
  • 266
  • 850
  • 1,236
  • @Gumbo: I find it preferable. Easier to read, no? Otherwise you read it like "it ess raining" and that's just weird. – mpen Oct 21 '10 at 17:16
  • “It’s” and “its” have a different meaning. The preferable variant would be to use its expanded (unambiguous) variant, so “it is” or “it has”. – Gumbo Oct 21 '10 at 17:44
  • @Gumbo: It's a URL. It's supposed to be short and concise.. if anything I'd strip out words like "is" and "has" too. No one is going to be looking for grammatical errors in a URL. And if they can't figure out "its-raining" actually means "it is raining" because there's no apostrophe....then... they need to go back to school. – mpen Oct 21 '10 at 19:09
  • @Mark: What about constructs with words that are ambiguous like `its-meaning`? – Gumbo Oct 21 '10 at 19:22
  • @Gumbo: When do you ever say "it is meaning"? And who cares? They can visit the website and read the actual title on the actual page in all its unicode glory. – mpen Oct 21 '10 at 19:55