6

In my blog, I store URIs on entities to allow them be customised (and friendly). Originally, they could contain spaces (eg. "/tags/ASP.NET MVC"), but the W3C validation says spaces are not valid.

The System.Uri class takes spaces, and seems to encode them as I want (eg. /tags/ASP.NET MVC becomes /tags/ASP.NET%20MVC), but I don't want to create a Uri just to throw it away, this feels dirty!

Note: None of Html.Encode, Html.AttributeEncode and Url.Encode will encode "/tags/ASP.NET MVC" to "/tags/ASP.NET%20MVC".


Edit: I edited the DataType part out of my question as it turns out DataType does not directly provide any validation, and there's no built-in URI validation. I found some extra validators at dataannotationsextensions.org but it only supports absolute URIs and it looks like spaces my be valid there too.

Danny Tuppeny
  • 40,147
  • 24
  • 151
  • 275
  • Your regular expression doesn't know about URL encoding, so it accepts any characters (but not no characters) after /tags/. There's a difference between URL encoded URLs (which is what the browser understands, and is sent over the network, for example) and your own set of paths. In this case, I'd store some "internal path" like `/tags/Tup Peny` and make sure it's encoded for the context when emitting it (in your case; URL encoding it for URL use). Does that make sense? :) – bzlm May 15 '11 at 11:25
  • It does, but if it's "valid" for me to store "/tags/Tup Peny", how do I encode it when I output it in an anchor's href attribute in a way that validates the W3C validator? – Danny Tuppeny May 15 '11 at 11:27
  • What about UrlPathEncode: http://msdn.microsoft.com/en-us/library/system.web.httpserverutility.urlpathencode.aspx – John Hoven May 15 '11 at 11:37
  • That won't work either :( check out the example in the docs, it'll encode the / – Danny Tuppeny May 15 '11 at 11:39
  • Do you have to store /tags/ then? That part seems like it could be added at runtime anyway. – John Hoven May 15 '11 at 11:55
  • Tags is just one example, other entities will have slashes (eg. "/2011/03/my blog post"), so I'd like a generic solution. I don't want to add "/tags/" in my views, so the idea is that all entities will have the uris (urls?) available as a property. – Danny Tuppeny May 15 '11 at 12:00
  • @Danny Regarding your note in the question; "Html.Encode, Html.AttributeEncode and Url.Encode" are not for encoding the path part of an URL. That's why they don't do what you want. :) – bzlm May 15 '11 at 15:38
  • I know, I tried them out of desperation, and included it here to try to avoid people posting answers saying to try them :D – Danny Tuppeny May 15 '11 at 19:04

4 Answers4

2

It seems that the only sensible thing to do is not allow spaces in URLs. Support for encoding them correctly seems flaky in .NET :(

I'm going to instead replace spaces with a dash when I auto-generate them, and validate they only contain certain characters (alphanumeric, dots, dashes, slashes).

I think the best way to use them would be to store %20 in the DB, as the space is "unsafe" and it seems non-trivial to then encode them in a way that will pass the W3C validator in .NET.

Danny Tuppeny
  • 40,147
  • 24
  • 151
  • 275
  • No, that's not true. Support for encoding spaces has been around everywhere on the web for many years now. The trick is to encode them using the correct encoding at the right time (as with any encoding). I agree that your dash-instead-of-space solution is fine, but please don't think support for %20 is broken around the web. :) – bzlm May 15 '11 at 15:26
  • Ok, maybe it's not that widespread, but .NET is a huge web platform and (seemingly) has no sensible way to fix "unsafe" characters in URLs :( – Danny Tuppeny May 15 '11 at 15:30
  • @Danny [Yes it does.](http://msdn.microsoft.com/en-us/library/system.web.httpserverutility.urlpathencode.aspx) Please review all the answers and their comments again. :) – bzlm May 15 '11 at 19:08
  • I did. That method encodes slashes in the path *and* fails to encode spaces. In the example, "http://www.contoso.com/articles.aspx?title = ASP.NET Examples" becomes "http%3a%2f%2fwww.contoso.com%2farticles.aspx?title = ASP.NET Examples". This is the exact opposite of what I want ;) – Danny Tuppeny May 15 '11 at 19:19
  • @Danny If you want `http://www.contoso.com/etc` to appear as the path part of an URL, then that's the way you have to encode it. I think you're confusing HTML, URIs, URLs, paths and queries here. I assume you're using the ASP.NET MVC 3 helpers, so you should use `Url.Action` anyway. But as long as you encode the path part (`/tag/TUP PENY`) using the path part encoder, and the query part (if any) using the query part encoder, and leave anything else (like the protocol, host name, port number etc) untouched, you will be fine. Encoding makes any character safe. That's the whole point. :) – bzlm May 15 '11 at 19:29
  • @Danny Regarding the "encoding of forward slashes"; that's purely cosmetic. TUP%2fPENY and TUP/PENY in the path part of an URL is the same thing. But the reason the slash is encoded is that you're telling the encoder to encode it. Slashes in ASP.NET MVC are usually route part delimiters, not just some characters that should be run through a generic URL path part encoder. – bzlm May 15 '11 at 19:36
  • @bzlm Using `Url.Action` doesn't help, as the full URL is stored on my entity. I'm simply trying to output a link to `"/tags/ASP.NET MVC"`, and the value I have in the database is `"/tags/ASP.NET MVC"`. Since it's invalid to write `` I need to find a way to encode the URL to ``. This is what the whole question was, but there doesn't seem to be a simple solution. – Danny Tuppeny May 15 '11 at 19:38
  • @bzlm The encoding of slashes is not cosmetic. If I output `` then the browser renders a link to `"/%2ftags"`! – Danny Tuppeny May 15 '11 at 19:39
  • @Danny The solution is to treat URLs the way they should be treated in a web MVC framework: not as arbitrary strings "stored on entities", but actual RESTful routes to resources. That way, `Url.Action` can construct the URL for you. (The second simplest solution is of course `UrlPathEncode` + manually decoding the slashes.) – bzlm May 15 '11 at 19:41
  • @bzlm That won't work. I'm storing blog posts and pages, and I deliberately want them to be able to have arbitrary URLs. Manually decoding slashes is a hack, which is why I said there's (seemingly) no nice way to do what I wanted. – Danny Tuppeny May 15 '11 at 19:45
  • @Danny Actually, it's using arbitrary URLs for non-static resources in ASP.NET MVC that's a hack. :) The reason the space needs to be encoded in `/tags/TUP PENY` isn't that the space is "unsafe" per se, but that the space actually does need encoding to `%20` for use in the path part of an URL. And when you tell the URL path encoder to encode the whole path it does exactly that, including the forward slashes. I'm not sure there's a "nice" way to accomplish what you want - surely there's an URL encoder somewhere in .NET which encodes spaces but not slashes, but using that would be a hack too. – bzlm May 15 '11 at 19:52
  • @bzlm My project uses MVC routing in the intended way for lots of stuff, but posts/pages are required to be completely flexible. That's a requirement of my app (and for compatibility with the previous one). Using Uri.Action doesn't solve this problem anyway, so the point is somewhat moot. If any URL parser that encodes spaces and not slashes is a hack, then a) storing spaces is probably invalid and b) javascript is a hack. – Danny Tuppeny May 15 '11 at 20:14
0

I haven't used it, but UrlPathEncode sounds like it may give what you want.

You can encode a URL using with the UrlEncode() method or the UrlPathEncode() method. However, the methods return different results. The UrlEncode() method converts each space character to a plus character (+). The UrlPathEncode() method converts each space character into the string "%20", which represents a space in hexadecimal notation.

EDIT: The javascript method encodeURI will use %20 instead of +. Add a reference to Microsoft.JScript and call GlobalObject.encodeURI. Tried the method here and you get the result you're looking for:

John Hoven
  • 4,085
  • 2
  • 28
  • 32
  • It also encodes slashes, which will not convert "/tag/ASP.NET MVC" to "/tag/ASP.NET%20MVC" :( – Danny Tuppeny May 15 '11 at 11:51
  • That seems to work, but doesn't it feel like a massive hack? Surely there's a much simpler way? What I'm trying to do doesn't feel unusual or uncommon :( – Danny Tuppeny May 15 '11 at 12:03
  • In a way it does. But looking at others who have asked this question already, it looks like there's not a really nice solution to this... either the 2 step solution Programming Hero says or this one. http://stackoverflow.com/questions/3375789/plus-in-mvc-argument-causes-404-on-iis-7-0 – John Hoven May 15 '11 at 12:07
  • Usually if something is difficult or hacky, you're doing it wrong. I'm wondering if spaces are just completely invalid and I should be storing "/tags/ASP.NET%20" instead :/ – Danny Tuppeny May 15 '11 at 12:14
  • Thats a possibility. I assume your input is really "ASP.NET MVC" and then you concat that with /tags/ before you put it in the database. In that case you could call UrlPathEncode on the input before combining the two. – John Hoven May 15 '11 at 12:18
  • Note that it looks like stackoverflow uses - instead of space, so it probably wasn't a problem they even wanted to touch ;) – John Hoven May 15 '11 at 12:19
  • on tags, it is. But on some cases I want to type custom URLs when creating an entity, eg. "/good stuff/my first article", so it falls down. I guess the real question is whether "/good stuff/my first article" is actually a valid URL at all. – Danny Tuppeny May 15 '11 at 12:20
  • Yeah, I actually use dashes instead of spaces, but I wanted to understand the issue rather than just dodging it this time :D I'm getting close to thinking validation (and a "SafeUrl" method) to only allow alphanumerics and dashes might be easiest! – Danny Tuppeny May 15 '11 at 12:22
  • its considered unsafe. You might consider replacing spaces with a dash or underscore - i Know i see that in a lot of blog formats. http://stackoverflow.com/questions/497908/are-urls-allowed-to-have-a-space-in-them – John Hoven May 15 '11 at 12:22
  • @ avoiding dodging the issue - I understand. I guess not-dodging feels hacky though.... since everyone else dodges it ;) – John Hoven May 15 '11 at 12:23
  • Yep, you're right. I'm going to just create some validation and specify characters that are allowed and use it for all entities. Shame that what sounds like a trivial problem is actually so complicated to solve :/ – Danny Tuppeny May 15 '11 at 12:29
  • My vote goes for dodging the issue. URLs with Hex-encoded values are pretty hard for humans to eyeball. I'd rather see `/blog/my-first-post/` over `/blog/my%20first%20post` (blarg). – Paul Turner May 15 '11 at 12:30
  • %20 is for the path part, and + is for the query part. "Within the query string, the plus sign is reserved as shorthand notation for a space" according to the RFC. So `/tags/Tup%20Penny?Tup+Peny` is the right way to encode the raw path `/tags/Tup Penny` with the query parameter `Tup Peny`. This is why `UrlEncode` and `UrlPathEncode` differ. Nothing to do with ASP.NET MVC. :) And AFAIK, this was only ever a problem when people had malformed URLs in their HTML, eg. actual un-encoded spaces and whatnot. Right? – bzlm May 15 '11 at 15:35
0

URI and URLs are two different things, URLs being a subset of URIs. As such, a URL has different restrictions to URIs.

To encode a path string to proper W3C URL encoding standards, use HttpUtility.UrlPathEncode(string). It'll add the encoded spaces you're after.

You should store your URLs in whatever form that is most useful for you to work with them. It can be useful to refer to them as URIs until the point at which you encode them into a URL-compliant format, but that's just semantics to help your design be a little clearer.

EDIT:

If you don't like the slashes being encoded, it's pretty simple to "decode" them by replacing the encoded %2f with the simpler /:

var path = "/tags/ASP.NET MVC";
var url = HttpUtility.UrlPathEncode(path).Replace("%2f", "/");
Paul Turner
  • 38,949
  • 15
  • 102
  • 166
  • UrlEncode will convert slashes, which messes up the links. The System.Uri class seems to correctly take a path and encode the spaces (without slashes), however EF seems to barf on a Uri without special handling :( – Danny Tuppeny May 15 '11 at 11:51
  • I'm trying to read about URI vs URL to make sure I'm referring to things correctly, but the more I read, the more I'm confused! ;( – Danny Tuppeny May 15 '11 at 12:00
  • Uniform Resource *Identifier* vs Uniform Resource *Locator*. Check out Wikipedia for some clear advice: http://en.wikipedia.org/wiki/Uniform_Resource_Identifier – Paul Turner May 15 '11 at 12:21
  • The lack of examples was what confused me, but now I've changed my properties to be called "Url", as it seems that's more appropriate. – Danny Tuppeny May 15 '11 at 12:31
0

I asked this similar question a while ago. The short answer was to replace spaces with "-" and then back out again. This is the source I used:

private static string EncodeTitleInternal(string title)
{
        if (string.IsNullOrEmpty(title))
                return title;

        // Search engine friendly slug routine with help from http://www.intrepidstudios.com/blog/2009/2/10/function-to-generate-a-url-friendly-string.aspx

        // remove invalid characters
        title = Regex.Replace(title, @"[^\w\d\s-]", "");  // this is unicode safe, but may need to revert back to 'a-zA-Z0-9', need to check spec

        // convert multiple spaces/hyphens into one space       
        title = Regex.Replace(title, @"[\s-]+", " ").Trim(); 

        // If it's over 30 chars, take the first 30.
        title = title.Substring(0, title.Length <= 75 ? title.Length : 75).Trim(); 

        // hyphenate spaces
        title = Regex.Replace(title, @"\s", "-");

        return title;
}
Community
  • 1
  • 1
Chris S
  • 64,770
  • 52
  • 221
  • 239
  • Though it doesn't really answer the question about using spaces/encoding, this is pretty much what I'm going to do now :-) – Danny Tuppeny May 16 '11 at 12:10
  • @Danny my answer is definitely don't bother URL encoding but replace with dashes (and remove other bad characters), to make them SEO friendly just like Stackoverflow does – Chris S May 16 '11 at 12:17