Does Nginx support raw unicode in paths?

Question

Browsers url encode unicode characters to %## by default.

However, I can make a request via CURL to http://localhost:8080/与 and nginx sees the path as "与". How is this possible? Does Nginx allow arbitrary unicode in it's path then?

For example, with this config I can set an additional header to see what nginx saw:

location ~* "(*UTF8)([^\w/\.\-\\% ])" {
        add_header "response" $1;
        return 200;
}

Request:

* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /与 HTTP/1.1
> User-Agent: curl/7.30.0
> Host: localhost:8080
> Accept: */*
> 
< HTTP/1.1 200 OK
* Server nginx/1.4.6 (Ubuntu) is not blacklisted
< Server: nginx/1.4.6 (Ubuntu)
< Date: Tue, 20 Jan 2015 21:44:51 GMT
< Content-Type: application/octet-stream
< Content-Length: 0
< Connection: keep-alive
< response: 与                                        <--- SEE THIS?
< 
* Connection #0 to host localhost left intact

However, when I remove the UTF8 marker then the header contains "?" as if nginx can't understand the character (or is only reading the first byte).

location ~* "([^\w/\.\-\\% ])" {
        add_header "response" $1;
        return 200;
}

Request:

* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /与 HTTP/1.1
> User-Agent: curl/7.30.0
> Host: localhost:8080
> Accept: */*
> 
< HTTP/1.1 200 OK
* Server nginx/1.4.6 (Ubuntu) is not blacklisted
< Server: nginx/1.4.6 (Ubuntu)
< Date: Tue, 20 Jan 2015 21:45:35 GMT
< Content-Type: application/octet-stream
< Content-Length: 0
< Connection: keep-alive
< response: ?
< 
* Connection #0 to host localhost left intact

Note: Changing this non-utf-8 regex to capture one-or-more ([^...]+) also results in the response: 与 header being sent (byte vs multibyte strings?)

Logging either regex match to a file results in an request entry like:

GET /\xE4\xB8\x8E HTTP/1.1

Related http://stackoverflow.com/questions/22357509/can-urls-have-utf-8-characters — Xeoncross, Jan 20 '15 at 22:27

score 18 · Accepted Answer · answered Jan 23 '15 at 17:37

Apart from the regexes and terminal configuration, this doesn't have anything to do with Unicode. The short answer to your question is: nginx doesn't care about Unicode encodings but it does accept non-ASCII bytes in URLs.

Here's the long answer that explains what you're seeing. If you enter the command

curl http://localhost:8080/与

and your terminal uses UTF-8 as encoding, it will encode the character 与 (U+4E0E) into the three-byte UTF-8 sequence

0xE4 0xB8 0x8E

curl apparently accepts non-ASCII bytes in URLs, although they're technically illegal. It will then send an HTTP request with these non-ASCII bytes. Since there is no default way to display these bytes, I'll use bolded C-style hex escapes like \x00 from now on to represent them. So the request line sent by curl looks like:

GET /\xE4\xB8\x8E HTTP/1.1

That's three bytes after the first /. If the terminal on which you view your logs also supports UTF-8, this will be displayed on your screen as

GET /与 HTTP/1.1

But this does not mean that there are Unicode characters in your HTTP request. On the HTTP level, we only deal with bytes.

nginx also seems to happily accept non-ASCII bytes in URLs. Then the following regex

(*UTF8)([^\w/\.\-\\% ])

working in UTF-8 mode treats the byte sequence \xE4\xB8\x8E as character 与 which matches \w, so the header will be

response: \xE4\xB8\x8E

which your terminal display as

response: 与

On the other hand, the regex

([^\w/\.\-\\% ])

works directly on bytes, so it will only match the first byte of your path, or nothing at all. For some reason, it thinks that the first byte of the sequence \xE4\xB8\x8E matches \w (maybe because it assumes Latin1 or Windows-1252 strings), so the header will be:

response: \xE4

which your terminal decides to display as

response: ?

because the byte \xE4 followed by a newline is invalid UTF-8. The regex ([^\w/\.\-\\% ])+ matches the whole byte sequence, so it produces the same result as the UTF-8 regex.

If you see something like

GET /\xE4\xB8\x8E HTTP/1.1

in your logs, that's because the authors of the logging code decided to use escape sequence for non-ASCII bytes. In general, this is a good idea because it always produces the same output regardless of terminal configuration and really shows what's going on: Your HTTP request simply contains non-ASCII bytes.

Its interesting that the "error.log" saved it as `[error] 4107#0: *1 open() "/usr/share/nginx/html/ケ" failed (2: No such file or directory), client: 10.0.2.2, server: localhost, request: "GET /ケ HTTP/1.1", host: "localhost:2200"` while the access log saved it as `GET /\xE4\xB8\x8E`. — Xeoncross, Jan 24 '15 at 00:48
error.log and acess.log are written by different modules and for different purposes. So it's fine. — Alexey Ten, Jan 28 '15 at 14:32

score 3 · Answer 2 · edited May 23 '17 at 10:30

3

Doesn't your own testing already seem to answer your question?

Yes, nginx does support Unicode in paths.

As a point of discussion, nginx will normalise URLs prior to location matching, as pointed out in the documentation at http://nginx.org/r/location. Which is why different "weird" requests (like those containing ../; or those encoding ? as %3F, thus making it part of the filename, instead of signifying the parameters known as $args) may still end up being served by a single location that does not look like a one-to-one match to the naked eye.

This normalisation may also explain why the "same" string appears differently within access_log (pre-normalised) vs. error_log (normalised).

edited May 23 '17 at 10:30

Community

1
1

answered Jan 26 '15 at 02:40

cnst

25,870
6
90
122

+1 for more parts of the puzzle. You say that `access_log` is "pre-normalized" - yet you say that `nginx will normalise URLs prior to location matching` which can happen before access_log is used (since access_log is defined inside a location). – Xeoncross Jan 26 '15 at 03:13
2

@Xeoncross, yes, that's exactly what I said; it doesn't matter where `access_log` is used; if you look further into the docs, you'll notice that `$uri` is normalised (and can also be `rewrite`'en), whereas `$request_uri` is not (it's supposed to be whatever comes in), hence, there's no contradiction in my answer or the expected behaviour – cnst Jan 26 '15 at 03:43
5

-1 for "nginx does support Unicode in paths.". No, nginx operates in raw byte mode, and the concept of unicode code points is not applicable in this context. In other words: you should be sure about what "Unicode" means, exactly. – Dr. Jan-Philip Gehrcke Jan 26 '15 at 14:28

Does Nginx support raw unicode in paths?

2 Answers2