1

I’m trying to detect URLs containing Chinese symbols in the query string.

I tested with a single symbol (also with (*UTF8)) but it didn’t work:

# Doesn't work
if ( $args ~ '址' ) {
    return 404;
}

It turned out that it’s because nginx "sees" it as URL-encoded:

# Works
if ( $args ~ '%E5%9D%80' ) {
    return 404;
}

The above works but it matches only this specific character. I URL-encoded a bunch of other symbols and the only pattern I saw is that they are all composed of three %-encoded chars with the first one being %E followed by a digit. But that’s a very lose pattern and it probably matches non-Chinese characters as well.

Is there some clean way to do this without generating a monstruous regex from the URL-encoding of all Chinese characters?

bfontaine
  • 18,169
  • 13
  • 73
  • 107
  • 1
    [This answer](https://stackoverflow.com/a/1366113/898478) has the Unicode range for Chinese characters, maybe it can be helpful to find a pattern. – m0skit0 Apr 03 '23 at 11:09
  • 1
    Based on [this answer](https://stackoverflow.com/a/54319862/653182), perhaps you could use [openresty](https://github.com/openresty/openresty) and then [ngx.escape_uri](https://github.com/openresty/lua-nginx-module#ngxescape_uri) to get the decoded query string. Then, to match the chinese symbols, you could use [unicode scripts or blocks](https://www.regular-expressions.info/unicode.html) to match them with `\p{Han}` or what you actually have to match. Tested here: https://regex101.com/r/5SDIpT/1 – Patrick Janser Apr 03 '23 at 14:04

0 Answers0