12

Users provide both properly escaped URLs and raw URLs to my website in a text input; for example I consider these two URLs equivalent:

https://www.cool.com/cool%20beans
https://www.cool.com/cool beans

Now I want to render these as <a> tags later, when viewing this data. I am stuck between encoding the given text and getting these links:

<a href="https://www.cool.com/cool%2520beans">   <!-- This one is broken! -->
<a href="https://www.cool.com/cool%20beans">

Or not encoding it and getting this:

<a href="https://www.cool.com/cool%20beans">
<a href="https://www.cool.com/cool beans">       <!-- This one is broken! -->

What's the best way out from a user experience standpoint with modern browsers? I'm torn between doing a decoding pass over their input, or the second option I listed above where we don't encode the href attribute.

Cory Kendall
  • 7,195
  • 8
  • 37
  • 64
  • You could of course use a server side script to check whether the url posted is encoded or not, then encode it if needed. – jtheman Apr 18 '13 at 22:50

2 Answers2

17

If you want to avoid double encoding the links you can just use urldecode() on both links, and then urlencode() afterwards, as decoding a URL such as "https://www.cool.com/cool beans" would return the same value, whereas decoding "https://www.cool.com/cool%20beans" would return with the space. This leaves both links free to be encoded properly.

Alternatively, encoded characters could be scanned for using strpos() function, e.g.

if ($pos = strpos($url, "%20") {
    //Encoded character found
}

Ideally for this an array of common encoded characters would be scanned for, in the place of the "%20"

Chris Brown
  • 4,445
  • 3
  • 28
  • 36
  • Decoding is a destructive process. Re-encoding is not correct, and can lead to subtle bugs. See for example this valid encoded target `/foo%24bar$baz`. If you decode it, you'll get `/foo$bar$baz`, and if you try to encode that again, it's not clear what to do, since `$` and `%24` have different meanings. – alx - recommends codidact Apr 19 '23 at 12:49
0

You should not accept such requests, as they are invalid.

https://datatracker.ietf.org/doc/html/rfc9112#section-3.2-3

No whitespace is allowed in the request-target. Unfortunately, some user agents fail to properly encode or exclude whitespace found in hypertext references, resulting in those disallowed characters being sent as the request-target in a malformed request-line.

Recipients of an invalid request-line SHOULD respond with either a 400 (Bad Request) error or a 301 (Moved Permanently) redirect with the request-target properly encoded. A recipient SHOULD NOT attempt to autocorrect and then process the request without a redirect, since the invalid request-line might be deliberately crafted to bypass security filters along the request chain.

Tell your clients to send well formed HTTP requests.

Trying to accept such requests will probably result in bugs. As others have suggested, you can pre-encode conditionally if you see invalid characters in the request target. You can alternatively decode+encode all requests. However, both are problematic if a request contains a character that has meaning both encoded and decoded (like for example $ and %24, which have different meanings).

The only thing that you can safely do is to reject such invalid requests.