Unicode regex to match a character class of Chinese characters

Question

^[一二三四五六七]、 doesn't match 一、

But ^一、 matches 一、.

Is my way of specifying a character class of Chinese characters wrong?

I read the regular expression from a file.

Are you specifying the characters in a unicode string or a byte string? What is the encoding of the file containing your code? — BrenBarn, Jun 16 '15 at 02:11
"the encoding of your file?" you mean my python script file? Where do you specify the encoding of file? — Tim, Jun 16 '15 at 02:13

Avinash Raj · Answer 1 · 2015-06-16T02:41:45.300

3

Works for me,

>>> import re
>>> re.match(u'^[一二三四五六七]、', u'一、')
<_sre.SRE_Match object; span=(0, 2), match='一、'>
>>> re.match(u'^[一二三四五六七]、', u'一、').group(0)
'一、'

I think you failed to define your regex as unicode string.

In python3, it would be

# -*- coding: utf-8 -*-

import re

with open('file') as f:
    reg = f.read().strip()
    print(re.match(reg, u'一、').group(0))

edited Jun 16 '15 at 02:41

answered Jun 16 '15 at 02:11

Avinash Raj

172,303
28
230
274

Does adding `u` also work for those cases which do not need `u`? – Tim Jun 16 '15 at 02:12
Also I read the regex from a text file by `myregex=open('myfile').read()`, where `myfile` content is `^[一二三四五六七]、`. How can I specify `u` then? – Tim Jun 16 '15 at 02:19
http://stackoverflow.com/questions/4182603/python-how-to-convert-a-string-to-utf-8 – Avinash Raj Jun 16 '15 at 02:22
It seems the replies there all specify the strings in Python scripts. I would like to know how to treat a regex read from a text file as `u`? – Tim Jun 16 '15 at 02:33
I am in Python 2.7. Both regex and the text are read from (different) text files – Tim Jun 16 '15 at 02:42
1

Always mention the version of python you're running.. Your question fails to provide these details.. – Avinash Raj Jun 16 '15 at 02:43
`u'一、'` seems to be invalid syntax in Python 3 – nhahtdh Jun 16 '15 at 05:29

score 1 · Answer 2 · answered Jun 16 '15 at 03:47

You need to make sure that you read the files using the correct encoding:

with open('my-regex-file', encoding='utf-8') as f:
    regex = re.compile(f.read())
with open('my-text-file', encoding='utf-8') as f:
    text = f.read()
if regex.match(text):
    print("It's a match!")

Unicode regex to match a character class of Chinese characters

2 Answers2

Linked