0

Please consider the following code:

import re

def qcharToUnicode(s):
    p = re.compile(r"QChar\((0x[a-fA-F0-9]*)\)")
    return p.sub(lambda m: '"' + chr(int(m.group(1),16)) + '"', s)

def fixSurrogatePresence(s) :
    '''Returns the input UTF-16 string with surrogate pairs replaced by the character they represent'''
    # ideas from:
    # http://www.unicode.org/faq/utf_bom.html#utf16-4
    # http://stackoverflow.com/a/6928284/1503120
    def joinSurrogates(match) :
        SURROGATE_OFFSET = 0x10000 - ( 0xD800 << 10 ) - 0xDC00
        return chr ( ( ord(match.group(1)) << 10 ) + ord(match.group(2)) + SURROGATE_OFFSET )
    return re.sub ( '([\uD800-\uDBFF])([\uDC00-\uDFFF])', joinSurrogates, s )

Now my questions below probably reflect a C/C++ way of thinking (and not a "Pythonic" one) but I'm curious nevertheless:

I'd like to know whether the evaluation of the compiled RE object p in qcharToUnicode and SURROGATE_OFFSET in joinSurrogates will take place at each call to the respective functions or only once at the point of definition? I mean in C/C++ one can declare the values as static const and the compile will (IIUC) make the construction occur only once, but in Python we do not have any such declarations.

The question is more pertinent in the case of the compiled RE object, since it seems that the only reason to construct such an object is to avoid the repeated compilation, as the Python RE HOWTO says:

Should you use these module-level functions, or should you get the pattern and call its methods yourself? If you’re accessing a regex within a loop, pre-compiling it will save a few function calls.

... and this purpose would be defeated if the compilation were to occur at each function call. I don't want to put the symbol p (or SURROGATE_OFFSET) at module level since I want to restrict its visibility to the relevant function only.

So does the interpreter do something like heuristically determine that the value pointed to by a particular symbol is constant (and visible within a particular function only) and hence need not be reconstructed at next function? Further, is this defined by the language or implementation-dependent? (I hope I'm not asking too much!)

A related question would be about the construction of the function object lambda m in qcharToUnicode -- is it also defined only once like other named function objects declared by def?

jamadagni
  • 1,214
  • 2
  • 13
  • 18
  • Even named functions defined by a `def` can be defined multiple times, if the entire `def` block is in a loop. In general Python makes very few assumptions about what will or will not change during the course of the program. Code is executed when it is encountered at runtime during program flow. – BrenBarn Jan 10 '14 at 07:06

3 Answers3

3

The simple answer is that as written, the code will be executed repeatedly at every function call. There is no implicit caching mechanism in Python for the case you describe.

You should get out of the habit of talking about "declarations". A function definition is in fact also "just" a normal statement, so I can write a loop which defines the same function repeatedly:

for i in range(10):
    def f(x):
        return x*2
    y = f(i)

Here, we will incur the cost of creating the function at every loop run. Timing reveals that this code runs in about 75% of the time of the previous code:

def f(x):
    return x*2

for i in range(10):
    y = f(i)

The standard way of optimising the RE case is as you already know to place the p variable in the module scope, i.e.:

p = re.compile(r"QChar\((0x[a-fA-F0-9]*)\)")

def qcharToUnicode(s):
    return p.sub(lambda m: '"' + chr(int(m.group(1),16)) + '"', s)

You can use conventions like prepending "_" to the variable to indicate it is not supposed to be used, but normally people won't use it if you haven't documented it. A trick to make the RE function-local is to use a consequence about default parameters: they are executed at the same time as the function definition, so you can do this:

def qcharToUnicode(s, p=re.compile(r"QChar\((0x[a-fA-F0-9]*)\)")):
    return p.sub(lambda m: '"' + chr(int(m.group(1),16)) + '"', s)

This will allow you the same optimisation but also a little more flexibility in your matching function.

Thinking properly about function definitions also allows you to stop thinking about lambda as different from def. The only difference is that def also binds the function object to a name - the underlying object created is the same.

chthonicdaemon
  • 19,180
  • 2
  • 52
  • 66
  • 1
    Your second code snippet doesn't work. When any of those `f` functions is called, the `i` in `x*i` is evaluated using the current value of `i`, not the value from the time the function was defined. – user2357112 Jan 10 '14 at 07:10
  • @user2357112 pending verification, but I believe he'd only have that bug in the javascript equivalent. – stewSquared Jan 10 '14 at 07:20
  • @stewSquared: Javascript and Python both use function scope rather than block scope, so the problem happens in both. – user2357112 Jan 10 '14 at 07:22
  • @user2357112 We're possibly speaking of different problems. In the javascript version, all the functions returned would be equivalent to "function(x) {return x*10}" whereas in the python version, they are indeed distinct functions. – stewSquared Jan 10 '14 at 07:24
  • The idea would be to use `f` inside the loop, so it doesn't have any practical issues. This would work even if you passed `f` to another function. Unless you are in the habit of changing the loop variable in a loop, this is still useful, although it is worth remembering that the behaviour of `f` could change if `i` changes later. – chthonicdaemon Jan 10 '14 at 07:26
  • @chthonicdaemon: If the idea is to use `f` inside the loop, it's clearer to define it outside the loop. It'll work the same way, but defining it outside makes it clearer that the behavior of the function will change with each iteration. You're less likely to pass it as a callback to some other part of the codebase and tear your hair out wondering why the callback isn't working. – user2357112 Jan 10 '14 at 07:33
  • @user2357112 I understand what you are talking about, but I don't agree that defining it outside makes it clearer that it will change inside, my intuition works exactly the other way around - I expect things inside the loop to be affected by the loop variable. I have been bitten by the "changing function" problem before, although I almost never use callbacks so I've been spared most of those errors. I think I will revise my answer a little to make it clearer. In fact I agree that the answer as written is not factually accurate about the difference of the functions, so I'll take it out. – chthonicdaemon Jan 10 '14 at 07:37
1

Yes, they are. Suppose re.compile() had a side-effect. That side effect would happen everytime the assignment to p was made, ie., every time the function containing said assignment was called.

This can be verified:

def foo():
    print("ahahaha!")
    return bar

def f():
    return foo()
def funcWithSideEffect():
    print("The airspeed velocity of an unladen swallow (european) is...")
    return 25

def funcEnclosingAssignment():
    p = funcWithSideEffect()
    return p;

a = funcEnclosingAssignment()
b = funcEnclosingAssignment()
c = funcEnclosingAssignment()

Each time the enclosing function (analogous to your qcharToUnicode) is called, the statement is printed, revealing that p is being re-evaluated.

stewSquared
  • 827
  • 5
  • 24
1

Python is a script/interpreted language... so yes, the assignment will be made every time you call the function. The interpreter will parse your code only once, generating Python bytecode. The next time you call this function, it will be already compiled into Python VM bytecode, so the function will be simply executed.

The re.compile will be called every time, as it would be in other languages. If you want to mimic a static initialization, consider using a global variable, this way it will be called only once. Better, you can create a class with static methods and static members (class and not instance members).

You can check all this using the dis module in Python. So, I just copied and pasted your code in a teste.py module.

>>> import teste
>>> import dis
>>> dis.dis(teste.qcharToUnicode)
  4           0 LOAD_GLOBAL              0 (re)
              3 LOAD_ATTR                1 (compile)
              6 LOAD_CONST               1 ('QChar\\((0x[a-fA-F0-9]*)\\)')
              9 CALL_FUNCTION            1
             12 STORE_FAST               1 (p)

  5          15 LOAD_FAST                1 (p)
             18 LOAD_ATTR                2 (sub)
             21 LOAD_CONST               2 (<code object <lambda> at 0056C140, file "teste.py", line 5>)
             24 MAKE_FUNCTION            0
             27 LOAD_FAST                0 (s)
             30 CALL_FUNCTION            2
             33 RETURN_VALUE
nmenezes
  • 910
  • 6
  • 12
  • 1
    Um it says "don't use comment for thanks" but I feel somewhat lacking in etiquette to not say thanks for all the useful replies. I've upvoted them all and accepted one. Especially this one is useful because I did not know about `dis`. – jamadagni Jan 11 '14 at 04:48