How can I safely add user-supplied URLs to my HTML page?

Question

As with any user supplied data, the URLs will need to be escaped and filtered appropriately to avoid all sorts of exploits. I want to be able to

Put user supplied URLs in href attributes. (Bonus points if I don't get screwed if I forget to write the quotes)
```
<a href="ESCAPED_USER_URL_GOES_HERE">...</a>
```
Forbid malicious URLs such as javascript: stuff or links to evil domain names.
Allow some leeway for the users. I don't want to raise an error just because they forgot to add an http:// or something like that.

Unfortunately, I can't find any "canonical" solution to this sort of problem. The only thing I could find as inspiration is the encodeURI function from Javascript but that doesn't help with my second point since it just does a simple URL parameter encoding but leaving alone special characters such as : and /.

what is an "evil domain name"? how would any code logic be able to tell the difference between an evil and a good one? and also... you're saying you want to put user supplied URLs, but forbid "links". What does that even mean? — eis, Feb 11 '13 at 07:05
@eis: I'm being vague on purpuse. For example, "evil" could be something from a blacklist I have. The important bit is that I want to be able to analyse the URLs (just using encodeURI wouldn't do that, for example). As for the "links" part, that was just a typo. — hugomg, Feb 11 '13 at 13:36

score 3 · Answer 1 · edited May 23 '17 at 12:04

OWASP provides a list of regular expressions for validating user input, one of which is used for validating URLs. This is as close as you're going to get to a language-neutral, canonical solution.

More likely you'll rely on the URL parsing library of the programming language in use. Or, use a URL parsing regex.

The workflow would be something like:

Verify the supplied string is a well-formed URL.
Provide a default protocol such as http: when no protocol is specified.
Maintain a whitelist of acceptable protocols (http:, https:, ftp:, mailto:, etc.)
1. The whitelist will be application-specific. For an address-book app the mailto: protocol would be indispensable. It's hard to imagine a use case for the javascript: and data: protocols.
Enforce a maximum URL length - ensures cross-browser URLs and prevents attackers from polluting the page with megabyte-length strings. With any luck your URL-parsing library will do this for you.
Encode a URL string for the usage context. (Escaped for HTML output, escaped for use in an SQL query, etc.).

Forbid malicious URLs such as javascript: stuff or links or evil domain names.

You can utilize the Google Safe Browsing API to check a domain for spyware, spam or other "evilness".

note also that validating is always a compromise. For example, regex given in the OWASP site for urls accepts `http://google` as a valid url, but not `www.google.com` or `http://www.hän.fi/` (an example of an [IDNA](http://en.wikipedia.org/wiki/Internationalized_domain_name)). It also accepts `http://user:pass@domain.com`, which might not be something you want to allow. — eis, Feb 11 '13 at 09:56

score 0 · Accepted Answer · answered Feb 12 '13 at 00:16

For the first point, regular attribute encoding works just fine. (escape characters into HTML entities. escaping quotes, the ampersand and brackets is OK if attributes are guaranteed to be quotes. Escaping other alphanumeric characters will make the attribute safe if its accidentally unquoted.

The second point is vague and depends on what you want to do. Just remember to use a whitelist approach instead of a blacklist one its possible to use html entity encoding and other tricks to get around most simple blacklists.

How can I safely add user-supplied URLs to my HTML page?

2 Answers2