Please consider the following code:
import re
def qcharToUnicode(s):
p = re.compile(r"QChar\((0x[a-fA-F0-9]*)\)")
return p.sub(lambda m: '"' + chr(int(m.group(1),16)) + '"', s)
def fixSurrogatePresence(s) :
'''Returns the input UTF-16 string with surrogate pairs replaced by the character they represent'''
# ideas from:
# http://www.unicode.org/faq/utf_bom.html#utf16-4
# http://stackoverflow.com/a/6928284/1503120
def joinSurrogates(match) :
SURROGATE_OFFSET = 0x10000 - ( 0xD800 << 10 ) - 0xDC00
return chr ( ( ord(match.group(1)) << 10 ) + ord(match.group(2)) + SURROGATE_OFFSET )
return re.sub ( '([\uD800-\uDBFF])([\uDC00-\uDFFF])', joinSurrogates, s )
Now my questions below probably reflect a C/C++ way of thinking (and not a "Pythonic" one) but I'm curious nevertheless:
I'd like to know whether the evaluation of the compiled RE object p
in qcharToUnicode
and SURROGATE_OFFSET
in joinSurrogates
will take place at each call to the respective functions or only once at the point of definition? I mean in C/C++ one can declare the values as static const
and the compile will (IIUC) make the construction occur only once, but in Python we do not have any such declarations.
The question is more pertinent in the case of the compiled RE object, since it seems that the only reason to construct such an object is to avoid the repeated compilation, as the Python RE HOWTO says:
Should you use these module-level functions, or should you get the pattern and call its methods yourself? If you’re accessing a regex within a loop, pre-compiling it will save a few function calls.
... and this purpose would be defeated if the compilation were to occur at each function call. I don't want to put the symbol p
(or SURROGATE_OFFSET
) at module level since I want to restrict its visibility to the relevant function only.
So does the interpreter do something like heuristically determine that the value pointed to by a particular symbol is constant (and visible within a particular function only) and hence need not be reconstructed at next function? Further, is this defined by the language or implementation-dependent? (I hope I'm not asking too much!)
A related question would be about the construction of the function object lambda m
in qcharToUnicode
-- is it also defined only once like other named function objects declared by def
?