6

I have a simple bare WSGI application:

def application(environ, start_response):
    start_response('200 OK', [('Content-Type','text/html')])
    print('PATH_INFO:', environ['PATH_INFO'])
    return [b'<p>Hello World</p>']

if __name__ == '__main__':
    from wsgiref import simple_server
    server = simple_server.make_server('0.0.0.0', 8080, application)
    server.serve_forever()

I make two requests:

C:\>curl "http://localhost:8080/<foo>"
<p>Hello World</p>
C:\>curl "http://localhost:8080/%3Cfoo%3E"
<p>Hello World</p>

I get this output:

C:\code>python foo.py
PATH_INFO: /<foo>
127.0.0.1 - - [09/Mar/2014 13:48:39] "GET /<foo> HTTP/1.1" 200 18
PATH_INFO: /<foo>
127.0.0.1 - - [09/Mar/2014 13:48:47] "GET /%3Cfoo%3E HTTP/1.1" 200 18

See how my application gets the URL decoded path /<foo> even when the client requests /%3Cfoo%3E.

It shows that wsgiref.simple_server ensures that my application always gets the URL-decoded path in environ['PATH_INFO'].

But I can't find this behavior documented anywhere in PEP-3333. Can you please point me to an official documentation that documents this behavior?

Lone Learner
  • 18,088
  • 20
  • 102
  • 200
  • Why would the difference matter? – Ignacio Vazquez-Abrams Mar 09 '14 at 08:25
  • @IgnacioVazquez-Abrams It would not matter. – Lone Learner Mar 09 '14 at 08:26
  • @LoneLearner It matters greatly if there is an encoded forward slash character in the path. The WSGI application has no way to tell that the decoded slash was not supposed to be a path delimiter. – Ian Goldby Jun 04 '19 at 11:12
  • @IanGoldby I thought a forward slash would always be a path delimiter irrespective of whether it occurs literally or encoded (`%2F`) in the URL. For example [https://stackoverflow.com/q%2F22280010%2F1175080](https://stackoverflow.com/q%2F22280010%2F1175080) always leads to this question even though the path delimiters are encoded as `%2F` in this URL. The application does not need a way to tell if the decoded slash was not supposed to be a path delimiter because it is always a path delimiter. – Lone Learner Jun 05 '19 at 11:33
  • @LoneLearner I think what you are seeing is the browser decoding the %2F before it uses the URL. I'm not sure why the browser does this, but [RFC 3986](https://tools.ietf.org/html/rfc3986#section-2.2) clearly says that reserved characters such as / *can* be used in the path if they are percent-encoded. See also https://stackoverflow.com/a/38435903 – Ian Goldby Jun 05 '19 at 13:07
  • @IanGoldby That's not what I am seeing. Here's how I verify it. Fire up a netcat listener: `nc -l 8000`. Then in the browser, I visit http://localhost:8000/q%2Fx%2Fy and I find this request reaching netcat: `GET /q%2Fx%2Fy HTTP/1.1`. – Lone Learner Jun 05 '19 at 14:55
  • @LoneLearner You're right - apologies for my wrong guess. I was fooled by the URL shown in the status bar on hoverover where the %2F is decoded. The RFC clearly implies %2F should be treated differently to / but it may well be that in practice webservers do not follow this. Certainly if you can avoid ever having to handle a slash character in a URL that isn't a delimiter then you will probably have an easier time. But to repeat my orginal point, the RFC *does* allow it. – Ian Goldby Jun 06 '19 at 07:21
  • @LoneLearner Here we are: In Apache there is a setting AllowEncodedSlashes that affects the interpretation of %2F either as a path delimiter or as a literal. It seems this setting exists [to protect lame CGI scripts from themselves](https://stackoverflow.com/a/5944638). – Ian Goldby Jun 06 '19 at 07:25

1 Answers1

2

The value of REQUEST_URI from the actual HTTP request line, if the server makes it available, would be:

REQUEST_URI: '/%3Cfoo%3E'

This is probably the case even if you used:

curl "http://localhost:8080/<foo>"

because curl would encode the URL before sending to use the % escapes.

REQUEST_URI is not I believe covered by any RFC but is a variable provided by many servers. You cannot rely on its presence though, so don't write your WSGI application to depend on it existing.

The web server will decode the % escapes in REQUEST_URI before processing it. The result which will end up in PATH_INFO will thus always be:

PATH_INFO: '/<foo>'

The decoding is covered by the CGI and related RFCs that WSGI builds on.

See for example:

Graham Dumpleton
  • 57,726
  • 6
  • 119
  • 134
  • When I use `curl "http://localhost:8080/"`, curl is not encoding the URL. Here is the evidence: Run `nc -l 8080` on one terminal. In another, run `curl "http://localhost:8080/"`. The first terminal shows this request received: `GET / HTTP/1.1`. – Lone Learner Jun 05 '19 at 15:03