In Python 3.3, I want to match the pattern below, but it keeps failing.
摄氏零下253
I used the regex below.
[^\x00-\x47\x58-\x7F]+
Dosen't it exclude all of ascii except digits?
In Python 3.3, I want to match the pattern below, but it keeps failing.
摄氏零下253
I used the regex below.
[^\x00-\x47\x58-\x7F]+
Dosen't it exclude all of ascii except digits?
Depending on what programming language you are using, you could use the following.
[\p{Han}\p{N}]+
\p{Han}
matches characters in the Han script.
\p{N}
matches any kind of numeric character in any script.
You are mixing up the decimal and hexadecimal values for ASCII numbers. The \x
escape sequence denotes a hexadecimal escape, for which you should use the hex value of the ASCII character you need.
Referring to the ASCII table (http://www.asciitable.com/), the range should be 0
to 2F
and then 3A
to 7F
, and your regex should look like this:
[^\x00-\x2F\x3A-\x7F]+
However, the above regex does include characters besides Chinese ones (in fact, it includes everything except the 127 ASCII characters minus the digits).
Unsure what language you would be doing this in, but this regex works in PHP when using predefined Unicode scripts:
/(?:[\p{Han}0-9]+)/simu
Ditto with this which might be more portable since not all implementations of regex have the predefined Unicode scripts set:
/[\x{4e00}-\x{9fa5}0-9]+/simu
And here is some test code with both regex in place; comment one or the other out to test:
// Set the test string.
$string = '摄氏零下253';
// Run it through preg_match.
// $regex = "/(?:[\p{Han}0-9]+)/simu";
$regex = "/[\x{4e00}-\x{9fa5}0-9]+/simu";
preg_match($regex, $string, $matches);
// Send a UTF-8 header out so it looks nice.
header('Content-Type: text/html; charset=UTF-8');
// Dump the matches.
echo '<pre>';
print_r($matches);
echo '</pre>';
And here are the results of that script:
Array
(
[0] => 摄氏零下253
)
There are some extensions to regular expressions like named characters groups.
You can the following group:
\p{Han} the chinese Han characters.
The regex is then:
[\p{Han}]+[0-9]+