0

^[一二三四五六七]、 doesn't match 一、

But ^一、 matches 一、.

Is my way of specifying a character class of Chinese characters wrong?

I read the regular expression from a file.

Raniz
  • 10,882
  • 1
  • 32
  • 64
Tim
  • 1
  • 141
  • 372
  • 590
  • Are you specifying the characters in a unicode string or a byte string? What is the encoding of the file containing your code? – BrenBarn Jun 16 '15 at 02:11
  • "the encoding of your file?" you mean my python script file? Where do you specify the encoding of file? – Tim Jun 16 '15 at 02:13

2 Answers2

3

Works for me,

>>> import re
>>> re.match(u'^[一二三四五六七]、', u'一、')
<_sre.SRE_Match object; span=(0, 2), match='一、'>
>>> re.match(u'^[一二三四五六七]、', u'一、').group(0)
'一、'

I think you failed to define your regex as unicode string.

In python3, it would be

# -*- coding: utf-8 -*-

import re

with open('file') as f:
    reg = f.read().strip()
    print(re.match(reg, u'一、').group(0))
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • Does adding `u` also work for those cases which do not need `u`? – Tim Jun 16 '15 at 02:12
  • Also I read the regex from a text file by `myregex=open('myfile').read()`, where `myfile` content is `^[一二三四五六七]、`. How can I specify `u` then? – Tim Jun 16 '15 at 02:19
  • http://stackoverflow.com/questions/4182603/python-how-to-convert-a-string-to-utf-8 – Avinash Raj Jun 16 '15 at 02:22
  • It seems the replies there all specify the strings in Python scripts. I would like to know how to treat a regex read from a text file as `u`? – Tim Jun 16 '15 at 02:33
  • I am in Python 2.7. Both regex and the text are read from (different) text files – Tim Jun 16 '15 at 02:42
  • 1
    Always mention the version of python you're running.. Your question fails to provide these details.. – Avinash Raj Jun 16 '15 at 02:43
  • `u'一、'` seems to be invalid syntax in Python 3 – nhahtdh Jun 16 '15 at 05:29
1

You need to make sure that you read the files using the correct encoding:

with open('my-regex-file', encoding='utf-8') as f:
    regex = re.compile(f.read())
with open('my-text-file', encoding='utf-8') as f:
    text = f.read()
if regex.match(text):
    print("It's a match!")
Raniz
  • 10,882
  • 1
  • 32
  • 64