Reading Regex Character Class Specifier from File

Question

I am reading regular expression from a file and generally have had no problems until this one came along:

^X.{0,2}[\u2E80-\u9FFF]  # \u2E80-\u9FFF matches most Chinese and Japanese characters

The regex works fine when compiled internally:

p = re.compile(u'^X.{0,2}[\u2E80-\u9FFF]', re.IGNORECASE | re.UNICODE)
print p.search(u'XFlowers for you')  
>> none
print p.search(u'X桜桜桜桜')
>> <match object>

but the character range specifier is apparently garbled in the import process as it matches just about anything starting with X thereafter:

f = codecs.open(filename, "r", "utf-8")
lines = f.read().splitlines()
filePatterns = FileHelper.fileToList(ignoreFile)
patternList = [re.compile(x, re.IGNORECASE | re.UNICODE) for x in ignorePatterns]

for name in [u'XFlowers for you', u'X桜桜桜桜']
    for pattern in patternList:
        print pattern.search(name):

This will match both strings.

Anyone know how to solve this on? Thanks!

Just a guess, but i think the encoding on the file is wrong, try reading it as unicode or as ascii with the escape sequence — user230910, Jan 06 '15 at 09:54
I thought the 'utf-8' specifier forced interpretation as unicode. Is there another facet to this? — Toaster, Jan 06 '15 at 10:15

score 3 · Accepted Answer · edited May 23 '17 at 10:26

The problem lies here:

>>> u'^X.{0,2}[\u2E80-\u9FFF]'
u'^X.{0,2}[\u2e80-\u9fff]'

vs

>>> '^X.{0,2}[\u2E80-\u9FFF]'
'^X.{0,2}[\\u2E80-\\u9FFF]'

Notice the difference? The first example gives you a Unicode string with actual Unicode characters (that are only displayed as escape sequences), the second gives you a non-Unicode string with backslashes and a syntactically broken character class.

When you read the expression from file, you get the second variant. You need to turn this into a Unicode string - either by saving the file as Unicode and using actual Unicode characters, not Python escape sequences, or by keeping everything as it is and using the helper function from this answer,

import re

def unicode_unescape(s):
        """
        Turn \uxxxx escapes into actual unicode characters
        """
        def unescape_one_match(matchObj):
                escape_seq = matchObj.group(0)
                return escape_seq.decode('unicode_escape')
        return re.sub(r"\\u[0-9a-fA-F]{4}", unescape_one_match, s)

you can do

>>> unicode_unescape('^X.{0,2}[\u2E80-\u9FFF]')
u'^X.{0,2}[\u2e80-\u9fff]'

or, in context:

f = codecs.open(filename, "r", "utf-8")
lines = f.read().splitlines()
filePatterns = FileHelper.fileToList(ignoreFile)
patternList = [re.compile(unicode_unescape(x), re.IGNORECASE | re.UNICODE) for x in patternList]

for name in [u'XFlowers for you', u'X桜桜桜桜']
    for pattern in patternList:
        print pattern.search(name);

P.S.: Another variant, if you are bold enough to do it (not sure about the security implications), is `eval('u"' + pattern + '"')`. — Tomalak, Jan 06 '15 at 10:22
Thank you, this was a big help. I kept finding postings saying that everything is AOK as Python does the right thing, which is usually the case. I just couldn't find the unescape means. — Toaster, Jan 06 '15 at 10:24
The Python interpreter parses Unicode string literals and turns `\uXXXX` references into actual Unicode characters in memory before the program runs. When you read a file, that step does not happen, naturally. Instead you simply get strings with backslashes in them, which of course is not what you want. That's also why using `eval()` works. — Tomalak, Jan 06 '15 at 10:26

score -1 · Answer 2 · answered Jan 06 '15 at 10:03

-1

If you need only english alphabets and numerals has to be matched, and not ascii or other characters try this regex - "\b^X[\u0000-\u007F]+\b"

It will match only "XFlowers for you"

Hope it will help.

Thanks.

answered Jan 06 '15 at 10:03

SasiRSK

1
1

That's the opposite of what I am trying to do. – Toaster Jan 06 '15 at 10:06
If so you can use negation (carat) symbol. \b^X[^\u0000-\u007F]+\b – SasiRSK Jan 06 '15 at 10:12
This is going down the wrong road, I'm pretty sure its a character/interpretation kind of thing – user230910 Jan 06 '15 at 10:18

Reading Regex Character Class Specifier from File

2 Answers2