Is this a Python/Numpy bug or a subtle gotcha?

Question

Consider the following two implemtations of the same piece of code. I would have thought they are identical but they are not.

Is this a Python/Numpy bug or a subtle gotcha? If the latter, what rule would make it obvious why it does not work as expected?

I was working with multiple arrays of data and having to process each array item by item, with each array manipulated by a table depending on it's metadata.

In the real world example 'n' is multiple factors and offsets but the following code still demonstrates the issue that I was getting the wrong result in all but one case.

import numpy as np

# Change the following line to True to show different behaviour
NEEDS_BUGS = False  # Changeme

# Create some data
data = np.linspace(0, 1, 10)
print(data)

# Create an array of vector functions each of which does a different operation on a set of data
vfuncd = dict()

# Two implementations
if NEEDS_BUGS:
    # Lets do this in a loop because we like loops - However WARNING this does not work!!
    for n in range(10):
        vfuncd[n] = np.vectorize(lambda x: x * n)
else:
    # Unwrap the loop - NOTE: Spoiler - this works
    vfuncd[0] = np.vectorize(lambda x: x * 0)
    vfuncd[1] = np.vectorize(lambda x: x * 1)
    vfuncd[2] = np.vectorize(lambda x: x * 2)
    vfuncd[3] = np.vectorize(lambda x: x * 3)
    vfuncd[4] = np.vectorize(lambda x: x * 4)
    vfuncd[5] = np.vectorize(lambda x: x * 5)
    vfuncd[6] = np.vectorize(lambda x: x * 6)
    vfuncd[7] = np.vectorize(lambda x: x * 7)
    vfuncd[8] = np.vectorize(lambda x: x * 8)
    vfuncd[9] = np.vectorize(lambda x: x * 9)

# Prove we have multiple different vectorised functions
for k, vfunc in vfuncd.items():
    print(k, vfunc)

# Do the work
res = {k: vfuncd[k](data) for k in vfuncd.keys()}

# Show the result
for k, r in res.items():
    print(k, r)

`np.vectorize` has several gotchas. I don't recommend using it. — hpaulj, Jul 01 '20 at 15:26
I don't think this is the fault of `np.vectorize` but rather this: https://stackoverflow.com/questions/2295290/what-do-lambda-function-closures-capture. Not that I'm advocating you using np.vectorize... — Dan, Jul 01 '20 at 15:27
Does this answer your question? [What do (lambda) function closures capture?](https://stackoverflow.com/questions/2295290/what-do-lambda-function-closures-capture) — Dan, Jul 01 '20 at 15:28
I should have added that I know how to fix this (and already have) but why it's behaviour is so obscure and if that's intentional. — Jay M, Jul 01 '20 at 15:44
It is intentional, albeit quite weird to me - and you didn't really fix it ;) What if the range was variable too? You can still use a loop with a few changes. But I think if you read the answers to my suggested duplicate, you'll see why this is happening (i.e. a gotcha not a bug) — Dan, Jul 01 '20 at 15:47

Dan · Accepted Answer · 2020-07-01T15:48:58.460

I don't know what exactly you're trying to achieve and if it's a bad idea or not (in terms of np.vectorize), but the issue you're facing is because of the way python makes closures. Quoting from an answer to the linked question:

Scoping in Python is lexical. A closure will always remember the name and scope of the variable, not the object it's pointing to. Since all the functions in your example are created in the same scope and use the same variable name, they always refer to the same variable.

in other words when you make that closure over n, you're not actually closing off the state of n, just the name. So when n changes, the value in your closure also changes. This is quite unexpected to me, but others find it natural.

Here is one fix using partial:

from functools import partial
.
.
.

def func(x, n):
    return x * n

for n in range(10):
    vfuncd[n] = np.vectorize(partial(func, n=n))

Or another using a factory method

def func_factory(n):
    return lambda x: x * n

for n in range(10):
    vfuncd[n] = np.vectorize(func_factory(n))

Man, closure came to my mind but a I fell into the Python gotcha when it comes to closures. Thanks for the pointer. It explains it all. It also indicates that we should convert the value of the variable by using np.scalar() even if the value of n does not change. — Tarik, Jul 01 '20 at 15:45

score 1 · Answer 2 · answered Jul 01 '20 at 15:39

1

It seems that the python variable n is bound to the vectorized expression:

for n in range(10):
    vfuncd[n] = np.vectorize(lambda x: x * n)

This fixes it as it creates a new object with which to bind:

for n in range(10):
    vfuncd[n] = np.vectorize(lambda x: x * np.scalar(n))

In fact this has implications in terms of performance as I assume the value of the python variable would have to be fetched repeatedly.

answered Jul 01 '20 at 15:39

Tarik

10,810
2
26
40

If you are using `np.vectorize` you aren't interested in performance (or at least shouldn't be). – hpaulj Jul 01 '20 at 16:12
@hpaulj could you please elaborate? – Tarik Jul 01 '20 at 18:32

score 1 · Answer 3 · answered Jul 01 '20 at 16:38

In [13]: data = np.linspace(0,1,11)

Since the data array can be multiplied with a simple:

In [14]: data*3                                                                         
Out[14]: array([0. , 0.3, 0.6, 0.9, 1.2, 1.5, 1.8, 2.1, 2.4, 2.7, 3. ])

we don't need the complication of np.vectorize to see the closure issue. A simple lambda is enough.

In [15]: vfuncd = {} 
    ...: for n in range(3): 
    ...:     vfuncd[n] = lambda x:x*n 
    ...:                                                                                
In [16]: vfuncd                                                                         
Out[16]: 
{0: <function __main__.<lambda>(x)>,
 1: <function __main__.<lambda>(x)>,
 2: <function __main__.<lambda>(x)>}
In [17]: {k:v(data) for k,v in vfuncd.items()}                                          
Out[17]: 
{0: array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ]),
 1: array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ]),
 2: array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ])}

We won't get the closure problem if we use a proper numpy "vectorization":

In [18]: data * np.arange(3)[:,None]                                                    
Out[18]: 
array([[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
       [0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ],
       [0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ]])

Or a simple iteration is we need a dictionary:

In [20]: {k:data*k for k in range(3)}                                                   
Out[20]: 
{0: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
 1: array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
 2: array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ])}

np.vectorize has a speed disclaimer. But it is justified where the function only takes scalar inputs, and we want the flexibility of numpy broadcasting - i.e. for 2 or more arguments.

Creating multiple vectorize is clearly an 'anti-pattern'. I'd rather see one vectorize with the appropriate arguments:

In [25]: f = np.vectorize(lambda x,n: x*n)                                              
In [26]: {n: f(data,n) for n in range(3)}                                               
Out[26]: 
{0: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
 1: array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
 2: array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ])}

That f can also produce the array Out[18] (but is slower):

In [27]: f(data, np.arange(3)[:,None])                                                  
Out[27]: 
array([[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
       [0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ],
       [0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ]])

Thanks for the long answer, but the example given is just that, the real code is far more complicated and really does need vectorize (it's taking data in 32 bit binary floating point and converting it to floating point, while checking extra fields and applying corrections) — Jay M, Jul 01 '20 at 16:50
That may be, but keep in mind that `np.vectorize` is not true `numpy` vectorization. It's just a prettier way of iterating (with some gotchas). You can make something just as pretty with a function definition or two. — hpaulj, Jul 01 '20 at 16:59

Is this a Python/Numpy bug or a subtle gotcha?

3 Answers3