0

I need to validate user input for an href on the server side and need to make sure only http:// and https:// are allowed as a protocol (if specified at all.) The objective is to eliminate possible malicious code like javascript:... or anything alike.

What makes it difficult is the number of ways the colon could be encoded in such string e.g. :, &#58, :, &#x0003A , :. I'd like to transform the value and see it as the browsers do before they render the page.

One option could be building a DOM document using AngleSharp as it does the perfect job when parsing attributes. Then I could retrieve the value and validate it but it seems somewhat of an overkill to build the whole DOM tree just to parse one value. Is there a way to use AngleSharp to parse just an attribute value? Or is there a lib which I could use just for this task?

I also found this question, but the method used in there does not really parse the URIs the way browsers do.

Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
jakubiszon
  • 3,229
  • 1
  • 27
  • 41
  • 1
    `//` is also a way browsers can navigate - `//google.com` – TryingToImprove Oct 08 '18 at 18:22
  • 1
    Whatever URL validation method you use, be sure to encode the url before outputting it for the user. You mentioned `javascript:`, but also keep in mind there are things like `https://example.com?whatever="/>` If `"` (and other html characters) aren't escaped there, an attacker could break out of the attribute. Lots of tricky things attackers can do, but the most important thing is escaping the output in this scenario. Also worth mentioning that even if you handle this perfectly, they can still link to a malicious site. – Gray Oct 08 '18 at 19:03
  • At the moment I do not see any simple way to interpret the attributes as the browsers see them so I decided to test links as I described here https://stackoverflow.com/a/52722132/523898 Still, I think having a function which can show what the browsers actually see as an attribute value would be a pretty useful thing. – jakubiszon Oct 09 '18 at 13:23

1 Answers1

0

You want the HtmlDecode() method. You may need to add a reference to the project to use it.

Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
  • Thank you but I am afraid it does not work as expected `javascript:alert(30)` is not interpreted the way the browsers do it. – jakubiszon Oct 08 '18 at 19:42
  • @Jakub Why are you required to support special encoding like this? Normal users would be very unlikely to use html entities or other alternative forms. Wouldn't something like a client-side check for http(s) at the start, followed by using the link you mentioned to verify the `scheme` is actually http/https by parsing the url be sufficient for your check? – Gray Oct 08 '18 at 20:41
  • Because a hacker could do this. I need to make sure the URLs I store are safe. – jakubiszon Oct 08 '18 at 21:07
  • @Jakub Why does simply disallowing/rejecting malformed URLs not solve that problem? Like I said - force the users to only match "http(s)://", then parse that URL with the C# library from the linked page to ensure the scheme is ACTUALLY http(s). As long as you encode the output, you should be fine. What requirement do you have that that does not cover? Just trying to understand. – Gray Oct 08 '18 at 22:03
  • @Gray I also need to support URLs relative to the same server. These could as well be `javascript/some-file`. Perhaps I could force them to include the protocol too e.g. `http://mysite/javascript/some-file`. – jakubiszon Oct 08 '18 at 22:20
  • 1
    @Jakub Anything you can do to limit the input will make this a lot easier/safer. I don't think it is a big trade-off to require the users to specify http(s)... then again, I am a penetration tester, not a UX person. – Gray Oct 08 '18 at 22:42