While you can use Unicode in web-page URLs, in practice and for usability you're restricted in the characters they can use - for example, if you want to allow people to manually type-in the full address then it's best to avoid all non-alphanumeric characters, especially punctuation, and if possible also avoid visual-homographs such as L
, l
, 1
, i
, I
, and so on.
So if you have a web-page for an article of writing whose internal title is "Kaajh'Káalbh!" then you'll want the URL-name to be kaajh-kaalbh
.
Note that the conversion from "Kaajh'Káalbh!" to "kaajh-kaalbh
" involves the loss of information - this is a "one-way function" - which is to say that given a particular output ("kaajh-kaalbh"
) it is not easy to determine the original input ("Kaajh'Káalbh!") - in this case it's because there can be multiple inputs that lead to the same output, so you cannot know what the original input is - because it could have been "Kaajh'Káalbh!" or "Kaajh Kaalbh" or "kaajh Kaalbh?" - and so on.
You could argue that you could still query the database and find which rows correspond to the input, and I imagine your query would look like this:
SELECT * FROM Articles WHERE GetUrlVersionOfTitle( Title ) = 'kaajh-kaalbh'
Where GetUrlVersionOfTitle
is a function in your SQL that would perform the conversion like so:
GetUrlVersionOfTitle( x ) = x.ToLower().Replace( ' ', '-' ).Replace( '\'', '-' ).Replace( etc )...
...which means your query becomes non-Sargable (see also) and would have terrible runtime query performance (because the database system would need to run the function on every row in your table, every time - obviously that's not good). It also doesn't solve the problem of ensuring that at most 1 row has the same URL-name (to guarantee that only 1 row matches a given URL name input).
The solution then is to precompute the URL-name, store it in a separate column, and also have a UNIQUE
constraint against it
CREATE TABLE Articles (
ArticleId int IDENTITY(1,1) NOT NULL PRIMARY KEY,
Title nvarchar(255) NOT NULL,
UrlTitle varchar(255) NOT NULL UNIQUE,
...
)
INSERT INTO Articles( Title, UrlTitle ) VALUES ( @title, @urlTitle )
(where @urlTitle
is a parameter whose value is the precomputed URL-friendly version of Title
).
And then it's simple to match the article corresponding to a given URL:
In ASP.NET MVC:
[Route("~/articles/{urlTitle}")]
public ActionResult GetArticle(String urlTitle) {
Article article
using( DbContext db = ... ) {
article = db.Articles.SingleOrDefault( a => a.UrlTitle == urlTitle );
}
return this.View( new ArticleViewModel( article ) );
}
In my own code, I generate URL-friendly titles by first converting text to a normalized Unicode representation, then stripping-out diacritics, and also dropping non-digit/letter characters, like so:
Note that this only really works for Latin script - I've never had to target a non-Latin system (e.g. Greek, Cyrillic, Arabic, Hebrew, Farsi etc) so YMMV, but the same principles apply:
public static String ConvertToUrlName(String title) {
if( title == null ) throw new ArgumentNullException(nameof(title));
// Convert to normalized Unicode
// see here: https://stackoverflow.com/a/249126/159145
title = title.Normalize( NormalizationForm.FormD );
StringBuilder sb = new StringBuilder( title.Length );
foreach(Char c in title) {
// If the character is a diacritic or other non-base character, then ignore it
if( CharUnicodeInfo.GetUnicodeCategory( c ) != UnicodeCategory.NonSpacingMark ) continue;
c = Char.ToLowerInvariant( c ); // convert to lower-case
if( Char.IsLetterOrDigit( c ) ) {
sb.Append( c );
}
else if( Char.IsWhiteSpace( c ) ) {
sb.Append( '-' );
}
// and ignore all other character classes, such as punctuation
}
String urlTitle = sb.ToString();
return urlTitle;
}
Ta-da.