18

What should be done against contents of href attribute: HTML or URL encoding?

<a href="???">link text</a>

On the one hand, since href attribute contains URL I should use URL encoding. On the other hand, I'm inserting this URL into HTML, so it must be HTML encoded.

Please help me to overcome this contradiction.

Thanks.


EDIT:

Here's the contradiction. Suppose there might be the '<' and '>' characters in the URL. URL encoding won't escape them, so there will be reserved HTML characters inside the href attribute, which violates the standard. HTML encoding will escape '<' and '>' characters and HTML will be valid, but after that there will be unexpected '&' characters in the URL (this is reserved character for URL, it's used as a delimiter of query string parameters).

Reserved URL characters forms a superset of reserved HTML characters except for the '<' and '>' that are reserved for HTML but not for URL.


EDIT 2:

I was wrong about '<' and '>' characters, they are actually percent escaped by URL encoding. If so, URL encoding is sufficient in this case, isn't it?

Maksim Tyutmanov
  • 423
  • 1
  • 5
  • 12
  • 2
    Have you tried anything so far? – Michael Sazonov Apr 17 '12 at 10:30
  • 3
    This "have you tried anything" meme is getting silly. What with browser error recovery, a large part of data encoding is to protect against security problems. How are you supposed to tell you got it right if you are trying something? Assume that whatever security testing suite you have has sufficient coverage? This is a perfectly reasonable question about a fundamental technique. – Quentin Apr 17 '12 at 10:39
  • Quentin is more or less right, but the question remains, what situations can be contradictory? Can you show an example? And did you try both solutions and did they both work, or both not work? – Mr Lister Apr 17 '12 at 10:42
  • Yeah, I've tried both ways and updated the question. It seems to me that HTML encoding isn't appropriate in this case at all. Now I'm trying to figure out is it really so. – Maksim Tyutmanov Apr 17 '12 at 12:23
  • Re your edit: I'm not sure what you mean with "HTML encoding will escape '<' and '>' characters and HTML will be valid, but after that there will be unexpected '&' characters in the URL" How so? `<` is simply the way to write an `<` in your HTML source, it is translated back to `<` at a very low level, long before it gets sent out to the server. Same with `&`: you should write `&` and the system will know you meant `&`. Or did you mean something else? – Mr Lister Apr 17 '12 at 20:45
  • Well, I tried to use only HTML encoding and it wouldn't work. It just put encoded HTML entities into the generated URL (the URL was http://localhost:59381/Home/Index?q=<tag>, the initial query string was q=). I'm not completely sure, but it seems to me that URL encoding will be sufficient in this case and there's no need to use HTML encoding at all. Solution with only URL encoding works fine with any kind of input I've tried so far. It seems I lack understanding of basic principles and will have to take a deeper look into this problem anyway. – Maksim Tyutmanov Apr 19 '12 at 10:30

1 Answers1

20

Construct a URL as normal. Follow the rules for constructing URLs. Encode data you put into it.

Then construct HTML as normal. Follow the rules for constructing HTML. Encode data as you put it into it.

i.e. Do both (but in the right order).

They aren't mutually exclusive, so there is no contradiction.

For example (this is a simplified example that assumes data in $_GET is correct and exists, don't do that in the real world):

$search_term = $_GET['q'];
$page = $_GET['page'];
$next_page = $page + 1;
$next_page_url = 'http://example.com/search?q=' . urlencode($search_term) . '&page=' . urlencode($next_page);
$html = '<a href="' . htmlspecialchars($next_page_url) . '">link text</a>';
cl0ne
  • 317
  • 5
  • 14
Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335
  • Thanks, Quentin, I've got your point. But I'm not quite sure about two things. 1) What would happen if htmlspecialchars() actually encoded something? If so, there would be '&' characters inside the URL, which is not allowed. 2) Is it possible for URL encoding to leave some reserved HTML characters after itself? I think it isn't. – Maksim Tyutmanov Apr 17 '12 at 12:34
  • There wouldn't be `&` inside the URL. There would be `&` inside the HTML. The HTML would be parsed and the character `&` would appear in the DOM. If you copy/pasted the HTML source of the attribute into a browser then it would break, but you shouldn't do that. It would also break if you stored the URL in a text file, gzipped it, then copy/pasted the binary content of the compressed file to the address bar. – Quentin Apr 17 '12 at 12:35
  • I don't recall the list of characters that are/aren't encoded in URLs off the top of my head. Certainly URLs can include characters (such as `&`) which do have special meaning in HTML (and which shouldn't be urlencoded if you want them to have their special meaning in the URL, as per the example I gave). – Quentin Apr 17 '12 at 12:36