acceptable url character replacement for apostrophes and hyphens

Question

I have a string as such: My First - Recipe's I want to translate this into a readable URL.

I'd like to change the spaces to - however there already exists a hyphen in the string. Also, there is an apostrophe.

I read online that using underscores is bad in clean URLS which is why I want to use hyphens but i cannot figure out what to change the hyphen to when there already exists one as well as the apostrophe

@Dai no that wont work as i need to translate it with the database record. — somejkuser, May 10 '17 at 18:52
If you'll be processing it on the backend, put an escape character of your choice in front of the hyphen. — freginold, May 10 '17 at 18:55
@jkushner That won't work then, because a "URI-safe" version of a string will contain less data than the original string data (see Information Theory) as it's a one-way function. You will need to store the URI-safe version in a separate database column if you want to perform a lookup by the URI-safe string. — Dai, May 10 '17 at 19:51

score 0 · Accepted Answer · edited May 23 '17 at 12:02

While you can use Unicode in web-page URLs, in practice and for usability you're restricted in the characters they can use - for example, if you want to allow people to manually type-in the full address then it's best to avoid all non-alphanumeric characters, especially punctuation, and if possible also avoid visual-homographs such as L, l, 1, i, I, and so on.

So if you have a web-page for an article of writing whose internal title is "Kaajh'Káalbh!" then you'll want the URL-name to be kaajh-kaalbh.

Note that the conversion from "Kaajh'Káalbh!" to "kaajh-kaalbh" involves the loss of information - this is a "one-way function" - which is to say that given a particular output ("kaajh-kaalbh") it is not easy to determine the original input ("Kaajh'Káalbh!") - in this case it's because there can be multiple inputs that lead to the same output, so you cannot know what the original input is - because it could have been "Kaajh'Káalbh!" or "Kaajh Kaalbh" or "kaajh Kaalbh?" - and so on.

You could argue that you could still query the database and find which rows correspond to the input, and I imagine your query would look like this:

SELECT * FROM Articles WHERE GetUrlVersionOfTitle( Title ) = 'kaajh-kaalbh'

Where GetUrlVersionOfTitle is a function in your SQL that would perform the conversion like so:

GetUrlVersionOfTitle( x ) = x.ToLower().Replace( ' ', '-' ).Replace( '\'', '-' ).Replace( etc )...

...which means your query becomes non-Sargable (see also) and would have terrible runtime query performance (because the database system would need to run the function on every row in your table, every time - obviously that's not good). It also doesn't solve the problem of ensuring that at most 1 row has the same URL-name (to guarantee that only 1 row matches a given URL name input).

The solution then is to precompute the URL-name, store it in a separate column, and also have a UNIQUE constraint against it

CREATE TABLE Articles (
    ArticleId int IDENTITY(1,1) NOT NULL PRIMARY KEY,
    Title     nvarchar(255) NOT NULL,
    UrlTitle  varchar(255)  NOT NULL UNIQUE,
    ...
)

INSERT INTO Articles( Title, UrlTitle ) VALUES ( @title, @urlTitle )

(where @urlTitle is a parameter whose value is the precomputed URL-friendly version of Title).

And then it's simple to match the article corresponding to a given URL:

In ASP.NET MVC:

[Route("~/articles/{urlTitle}")]
public ActionResult GetArticle(String urlTitle) {

    Article article
    using( DbContext db = ... ) {
        article = db.Articles.SingleOrDefault( a => a.UrlTitle == urlTitle );
    }

    return this.View( new ArticleViewModel( article ) );
}

In my own code, I generate URL-friendly titles by first converting text to a normalized Unicode representation, then stripping-out diacritics, and also dropping non-digit/letter characters, like so:

Note that this only really works for Latin script - I've never had to target a non-Latin system (e.g. Greek, Cyrillic, Arabic, Hebrew, Farsi etc) so YMMV, but the same principles apply:

public static String ConvertToUrlName(String title) {

    if( title == null ) throw new ArgumentNullException(nameof(title));

    // Convert to normalized Unicode
    // see here: https://stackoverflow.com/a/249126/159145
    title = title.Normalize( NormalizationForm.FormD );

    StringBuilder sb = new StringBuilder( title.Length );

    foreach(Char c in title) {

        // If the character is a diacritic or other non-base character, then ignore it
        if( CharUnicodeInfo.GetUnicodeCategory( c ) != UnicodeCategory.NonSpacingMark ) continue;

        c = Char.ToLowerInvariant( c ); // convert to lower-case

        if( Char.IsLetterOrDigit( c ) ) {
            sb.Append( c );
        }
        else if( Char.IsWhiteSpace( c ) ) {
            sb.Append( '-' );
        }
        // and ignore all other character classes, such as punctuation
    }

    String urlTitle = sb.ToString();
    return urlTitle;
}

Ta-da.

acceptable url character replacement for apostrophes and hyphens

1 Answers1