3

In Python 3.3, I want to match the pattern below, but it keeps failing.

摄氏零下253

I used the regex below.

[^\x00-\x47\x58-\x7F]+

Dosen't it exclude all of ascii except digits?

Giacomo1968
  • 25,759
  • 11
  • 71
  • 103
MJ Park
  • 303
  • 1
  • 10

4 Answers4

9

Depending on what programming language you are using, you could use the following.

[\p{Han}\p{N}]+

\p{Han} matches characters in the Han script.
\p{N} matches any kind of numeric character in any script.

Live Demo

hwnd
  • 69,796
  • 4
  • 95
  • 132
3

You are mixing up the decimal and hexadecimal values for ASCII numbers. The \x escape sequence denotes a hexadecimal escape, for which you should use the hex value of the ASCII character you need.

Referring to the ASCII table (http://www.asciitable.com/), the range should be 0 to 2F and then 3A to 7F, and your regex should look like this:

[^\x00-\x2F\x3A-\x7F]+

However, the above regex does include characters besides Chinese ones (in fact, it includes everything except the 127 ASCII characters minus the digits).

spinningarrow
  • 2,406
  • 22
  • 32
  • Correct assessment, but the thing with regex it’s n to clearcut when jumping between different languages when you get into the multibyte character world. – Giacomo1968 Jul 02 '14 at 03:48
  • That's true, but I wanted to point it out to the OP in case it leads to errors in the future. – spinningarrow Jul 02 '14 at 06:48
1

Unsure what language you would be doing this in, but this regex works in PHP when using predefined Unicode scripts:

/(?:[\p{Han}0-9]+)/simu

Ditto with this which might be more portable since not all implementations of regex have the predefined Unicode scripts set:

/[\x{4e00}-\x{9fa5}0-9]+/simu

And here is some test code with both regex in place; comment one or the other out to test:

// Set the test string.
$string = '摄氏零下253';

// Run it through preg_match.
// $regex = "/(?:[\p{Han}0-9]+)/simu";
$regex = "/[\x{4e00}-\x{9fa5}0-9]+/simu";
preg_match($regex, $string, $matches);

// Send a UTF-8 header out so it looks nice.
header('Content-Type: text/html; charset=UTF-8');

// Dump the matches.
echo '<pre>';
print_r($matches);
echo '</pre>';

And here are the results of that script:

Array
(
    [0] => 摄氏零下253
)
Giacomo1968
  • 25,759
  • 11
  • 71
  • 103
0

There are some extensions to regular expressions like named characters groups.

You can the following group:

\p{Han} the chinese Han characters.

The regex is then:

[\p{Han}]+[0-9]+
Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555