Regex to Match "Chinese+Number" pattern in Python

Question

In Python 3.3, I want to match the pattern below, but it keeps failing.

摄氏零下253

I used the regex below.

[^\x00-\x47\x58-\x7F]+

Dosen't it exclude all of ascii except digits?

Okay. I provided an answer that uses PHP for an example but provided two regex examples that you can use or adapt easily. — Giacomo1968, Jul 02 '14 at 05:27

score 9 · Answer 1 · answered Jul 02 '14 at 03:46

9

Depending on what programming language you are using, you could use the following.

[\p{Han}\p{N}]+

\p{Han} matches characters in the Han script.
\p{N} matches any kind of numeric character in any script.

Live Demo

answered Jul 02 '14 at 03:46

hwnd

69,796
4
95
132

`\p{N}` doesn't work in Java, any idea what the equivalent is? – Hasen Jul 20 '22 at 11:26

score 3 · Accepted Answer · answered Jul 02 '14 at 03:40

3

You are mixing up the decimal and hexadecimal values for ASCII numbers. The \x escape sequence denotes a hexadecimal escape, for which you should use the hex value of the ASCII character you need.

Referring to the ASCII table (http://www.asciitable.com/), the range should be 0 to 2F and then 3A to 7F, and your regex should look like this:

[^\x00-\x2F\x3A-\x7F]+

However, the above regex does include characters besides Chinese ones (in fact, it includes everything except the 127 ASCII characters minus the digits).

answered Jul 02 '14 at 03:40

spinningarrow

2,406
22
32

Correct assessment, but the thing with regex it’s n to clearcut when jumping between different languages when you get into the multibyte character world. – Giacomo1968 Jul 02 '14 at 03:48
That's true, but I wanted to point it out to the OP in case it leads to errors in the future. – spinningarrow Jul 02 '14 at 06:48

Giacomo1968 · Answer 3 · 2014-07-02T16:06:14.630

Unsure what language you would be doing this in, but this regex works in PHP when using predefined Unicode scripts:

/(?:[\p{Han}0-9]+)/simu

Ditto with this which might be more portable since not all implementations of regex have the predefined Unicode scripts set:

/[\x{4e00}-\x{9fa5}0-9]+/simu

And here is some test code with both regex in place; comment one or the other out to test:

// Set the test string.
$string = '摄氏零下253';

// Run it through preg_match.
// $regex = "/(?:[\p{Han}0-9]+)/simu";
$regex = "/[\x{4e00}-\x{9fa5}0-9]+/simu";
preg_match($regex, $string, $matches);

// Send a UTF-8 header out so it looks nice.
header('Content-Type: text/html; charset=UTF-8');

// Dump the matches.
echo '<pre>';
print_r($matches);
echo '</pre>';

And here are the results of that script:

Array
(
    [0] => 摄氏零下253
)

score 0 · Answer 4 · answered Jul 02 '14 at 04:10

0

There are some extensions to regular expressions like named characters groups.

You can the following group:

\p{Han} the chinese Han characters.

The regex is then:

[\p{Han}]+[0-9]+

answered Jul 02 '14 at 04:10

Willem Van Onsem

443,496
30
428
555

Regex to Match "Chinese+Number" pattern in Python

4 Answers4

Linked