2

Python treats \uxxxx as a unicode character escape inside a string literal (e.g. u"\u2014" gets interpreted as Unicode character U+2014). But I just discovered (Python 2.7) that standard regex module doesn't treat \uxxxx as a unicode character. Example:

codepoint = 2014 # Say I got this dynamically from somewhere

test = u"This string ends with \u2014"
pattern = r"\u%s$" % codepoint
assert(pattern[-5:] == "2014$") # Ends with an escape sequence for U+2014
assert(re.search(pattern, test) != None) # Failure -- No match (bad)
assert(re.search(pattern, "u2014")!= None) # Success -- This matches (bad)

Obviously if you are able to specify your regex pattern as a string literal, then you can have the same effect as if the regex engine itself understood \uxxxx escapes:

test = u"This string ends with \u2014"
pattern = u"\u2014$"
assert(pattern[:-1] == u"\u2014") # Ends with actual unicode char U+2014
assert(re.search(pattern, test) != None)

But what if you need to construct your pattern dynamically?

Chris
  • 9,986
  • 8
  • 48
  • 56
  • 1
    You are creating a string `'\u%s` first, then interpolating the codepoint, and that is *not* interpreted as `\u....` first. That is *expected behaviour*. Use `u'%s' % unichr(codepoint)` instead. – Martijn Pieters May 14 '13 at 11:16

2 Answers2

4

Use the unichr() function to create unicode characters from a codepoint:

pattern = u"%s$" % unichr(codepoint)
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • This is a good solution to my example. But it also makes me realize that my example did not exemplify what I was really hoping to ask about. I was less concerned with injecting a single codepoint into a string of known form, and more concerned with how to deal with an unspecified number of \u escapes inside an arbitrary string. That's the direction I was trying to go in with my own answer -- though perhaps I should have used unichr as part of that. – Chris May 14 '13 at 11:29
  • @Chris: I covered replacing **just** `\uxxxx` escapes using a regular expression in [this previous answer](http://stackoverflow.com/questions/14367369/unescape-unicode-escapes-but-not-carriage-returns-and-line-feeds-in-python/14367455#14367455). – Martijn Pieters May 14 '13 at 14:22
  • what does the `"%s$"` mean? – alvas Oct 17 '13 at 12:31
  • 1
    @alvas: `%s` is a placeholder for string interpolation; it is replaced by the output of the expression `unichr(codepoint)`. `$` is a regular expression meta character meaning 'match at the end of a line'. – Martijn Pieters Oct 17 '13 at 12:33
1

One possibility is, rather than call re methods directly, wrap them in something that can understand \u escapes on their behalf. Something like this:

def my_re_search(pattern, s):
    return re.search(unicode_unescape(pattern), s)

def unicode_unescape(s):
        """
        Turn \uxxxx escapes into actual unicode characters
        """
        def unescape_one_match(matchObj):
                escape_seq = matchObj.group(0)
                return escape_seq.decode('unicode_escape')
        return re.sub(r"\\u[0-9a-fA-F]{4}", unescape_one_match, s)

Example of it working:

pat  = r"C:\\.*\u20ac" # U+20ac is the euro sign
>>> print pat
C:\\.*\u20ac

path = ur"C:\reports\twenty\u20acplan.txt"
>>> print path
C:\reports\twenty€plan.txt

# Underlying re.search method fails to find a match
>>> re.search(pat, path) != None
False

# Vs this:
>>> my_re_search(pat, path) != None
True

Thanks to Process escape sequences in a string in Python for pointing out the decode("unicode_escape") idea.

But note that you can't just throw your whole pattern through decode("unicode_escape"). It will work some of the time (because most regex special characters don't change their meaning when you put a backslash in front), but it won't work in general. For example, here using decode("unicode_escape") alters the meaning of the regex:

pat = r"C:\\.*\u20ac" # U+20ac is the euro sign
>>> print pat
C:\\.*\u20ac # Asks for a literal backslash

pat_revised  = pat.decode("unicode_escape")
>>> print pat_revised
C:\.*€ # Asks for a literal period (without a backslash)
Community
  • 1
  • 1
Chris
  • 9,986
  • 8
  • 48
  • 56