Browsers url encode unicode characters to %## by default.
However, I can make a request via CURL to http://localhost:8080/与
and nginx sees the path as "与
". How is this possible? Does Nginx allow arbitrary unicode in it's path then?
For example, with this config I can set an additional header to see what nginx saw:
location ~* "(*UTF8)([^\w/\.\-\\% ])" {
add_header "response" $1;
return 200;
}
Request:
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /与 HTTP/1.1
> User-Agent: curl/7.30.0
> Host: localhost:8080
> Accept: */*
>
< HTTP/1.1 200 OK
* Server nginx/1.4.6 (Ubuntu) is not blacklisted
< Server: nginx/1.4.6 (Ubuntu)
< Date: Tue, 20 Jan 2015 21:44:51 GMT
< Content-Type: application/octet-stream
< Content-Length: 0
< Connection: keep-alive
< response: 与 <--- SEE THIS?
<
* Connection #0 to host localhost left intact
However, when I remove the UTF8 marker then the header contains "?" as if nginx can't understand the character (or is only reading the first byte).
location ~* "([^\w/\.\-\\% ])" {
add_header "response" $1;
return 200;
}
Request:
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /与 HTTP/1.1
> User-Agent: curl/7.30.0
> Host: localhost:8080
> Accept: */*
>
< HTTP/1.1 200 OK
* Server nginx/1.4.6 (Ubuntu) is not blacklisted
< Server: nginx/1.4.6 (Ubuntu)
< Date: Tue, 20 Jan 2015 21:45:35 GMT
< Content-Type: application/octet-stream
< Content-Length: 0
< Connection: keep-alive
< response: ?
<
* Connection #0 to host localhost left intact
Note: Changing this non-utf-8 regex to capture one-or-more ([^...]+)
also results in the response: 与
header being sent (byte vs multibyte strings?)
Logging either regex match to a file results in an request entry like:
GET /\xE4\xB8\x8E HTTP/1.1