0

An attempt to use numpy.vectorize with a lot of inputs and outputs arguments generates an error:

import pandas as pd
import numpy as np

df = pd.DataFrame([[0] * 20], columns=
['a01', 'b02', 'c03', 'd04', 'e05', 'f06', 'g07', 'h08', 'i09', 'j10',
 'k11', 'l12', 'n13', 'n14', 'o15', 'p16', 'q17', 'r18', 's19', 't20'])


def func(a01, b02, c03, d04, e05, f06, g07, h08, i09, j10,
         k11, l12, n13, n14, o15, p16, q17, r18, s19, t20):
    # ... some complex logic here, if, for loops and so on
    return (a01, b02, c03, d04, e05, f06, g07, h08, i09, j10,
            k11, l12, n13, n14, o15, p16, q17, r18, s19, t20)


df['a21'], df['b22'], df['c23'], df['d24'], df['e25'], df['f26'], df['g27'], df['h28'], df['i29'], df['j30'], \
df['k31'], df['l32'], df['n33'], df['n34'], df['o35'], df['p36'], df['q37'], df['r38'], df['s39'], df['t40'], \
    = np.vectorize(func)(
    df['a01'], df['b02'], df['c03'], df['d04'], df['e05'], df['f06'], df['g07'], df['h08'], df['i09'], df['j10'],
    df['k11'], df['l12'], df['n13'], df['n14'], df['o15'], df['p16'], df['q17'], df['r18'], df['s19'], df['t20'])
Traceback (most recent call last):
  File "ufunc.py", line 18, in <module>
    = np.vectorize(func)(
  File "C:\Python\3.8.3\lib\site-packages\numpy\lib\function_base.py", line 2108, in __call__
    return self._vectorize_call(func=func, args=vargs)
  File "C:\Python\3.8.3\lib\site-packages\numpy\lib\function_base.py", line 2186, in _vectorize_call
    ufunc, otypes = self._get_ufunc_and_otypes(func=func, args=args)
  File "C:\Python\3.8.3\lib\site-packages\numpy\lib\function_base.py", line 2175, in _get_ufunc_and_otypes
    ufunc = frompyfunc(_func, len(args), nout)
ValueError: Cannot construct a ufunc with more than 32 operands (requested number were: inputs = 20 and outputs = 20)

Note. The code is a simplification of the generated code. An actual number of rows would be in millions. Columns names do not have any regular structure. I choose the names of the columns to make counting easier.

Any suggestions on how to restructure the code while keeping the performance benefits of numpy.vectorize? I found that np.vectorize is much faster than "apply" or passing Series as input and output.

Thank you.

Pavel Ganelin
  • 314
  • 2
  • 9
  • Collecting the arguments into a tuple seems like the simplest approach. You'll pay some cost for tuple-packing and unpacking, so you'll have to benchmark for your use case. – bnaecker Jul 25 '20 at 16:53
  • 1
    You seem to be under a misunderstanding of first principles based on "the performance benefits of numpy.vectorize". Vectorize is a glorified python-level `for` loop. It offers little benefit besides legibility. – Mad Physicist Jul 25 '20 at 16:53
  • 1
    I would first ensure this isnt a [XY Problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) , also [np.vectorized isnt really vectorized](https://stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c/52674448#52674448) – anky Jul 25 '20 at 16:54
  • The X problem is: use function "func" for each row of the data frame and generate a new set of columns. The function "func" is "inheritently" scalar, so the code can not be replaced with column level code like this: pd["c"] = pd["a"]+pd["b"]. – Pavel Ganelin Jul 25 '20 at 17:07
  • I looked into the implementation of "vectorize" and it looks rather complex for a glorified for loop, so I thought there should be something else here. Are there? – Pavel Ganelin Jul 25 '20 at 17:12
  • "I found that np.vectorize is much faster than "apply" or passing Series as input and output." - where did you find this? Just reading, or experimenting with smaller problems? – hpaulj Jul 25 '20 at 18:35
  • @hpaulj Actual experimenting with the code similar to the one I posted. I had a frame with 80K. From my memory, approximate numbers: Apply with series 200 sec, Apply with tuples 80 sec, vectorized is much faster do not remember exact numbers. – Pavel Ganelin Jul 25 '20 at 18:45
  • When we say `np.vectorize` doesn't help, it's usually compared to a python `for` loop. `pandas` `apply` is a more complex operation, that I haven't used much. – hpaulj Jul 25 '20 at 18:56

1 Answers1

1

The basic purpose of np.vectorize it so make it easy to apply the full power of numpy broadcasting to a function that only accepts scalar inputs. Thus with a simple formatting function:

In [28]: def foo(i,j): 
    ...:     return f'{i}:{j}' 
    ...:                                                                                             
In [29]: foo(1,2)                                                                                    
Out[29]: '1:2'
In [31]: f = np.vectorize(foo, otypes=['U5'])                                                        

With vectorize I can pass lists/arrays of matching shape:

In [32]: f([1,2,3],[4,5,6])                            
Out[32]: array(['1:4', '2:5', '3:6'], dtype='<U3')

Or with (3,1) and (3) shape produce a (3,3) result:

In [33]: f(np.arange(3)[:,None], np.arange(4,7))                                                     
Out[33]: 
array([['0:4', '0:5', '0:6'],
       ['1:4', '1:5', '1:6'],
       ['2:4', '2:5', '2:6']], dtype='<U3')

I haven't seen your error before, but can guess where it comes from:

ufunc = frompyfunc(_func, len(args), nout)
ValueError: Cannot construct a ufunc with more than 32 operands 
(requested number were: inputs = 20 and outputs = 20)

The actual work is done with np.frompyfunc, which as you can see expects 2 numbers, the number of argument, and the number of returned values. 20 and 20 in your case. Apparently there's a limit of 32 total. 32 is the maximum number of dimensions a numpy can have. I've seen in a few other cases such as np.select. In any case, this limit is deeply embedded in numpy, so there isn't much you can do to avoid it.

You haven't told us about the "complex logic", but apparently it takes a whole row of the dataframe, and returns an equivalent size row.

Lets try applying another function to a dataframe:

In [41]: df = pd.DataFrame(np.arange(12).reshape(3,4),columns=['a','b','c','d'])                     
In [42]: df                                                                                          
Out[42]: 
   a  b   c   d
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

In [44]: def foo(a,b,c,d): 
    ...:     print(a,b,c,d) 
    ...:     return 2*a, str(b), c*d, c/d 
    ...:                                                                                             
In [45]: foo(1,2,3,4)                                                                                
1 2 3 4
Out[45]: (2, '2', 12, 0.75)

In [47]: f = np.vectorize(foo)                                                                       
In [48]: f(df['a'],df['b'],df['c'],df['d'])                                                          
0 1 2 3                                 # a trial run to determine return type
0 1 2 3
4 5 6 7
8 9 10 11
Out[48]: 
(array([ 0,  8, 16]),
 array(['1', '5', '9'], dtype='<U1'),
 array([  6,  42, 110]),
 array([0.66666667, 0.85714286, 0.90909091]))

vectorize returns a tuple of arrays, one for each returned value

Using pandas apply to the same function

In [80]: df.apply(lambda x:foo(*x),1)                                                                
0 1 2 3
4 5 6 7
8 9 10 11
Out[80]: 
0       (0, 1, 6, 0.6666666666666666)
1      (8, 5, 42, 0.8571428571428571)
2    (16, 9, 110, 0.9090909090909091)
dtype: object

A simple row iteration:

In [76]: for i in range(3): 
    ...:     print(foo(*df.iloc[i])) 
    ...:                                                                                             
0 1 2 3
(0, '1', 6, 0.6666666666666666)
4 5 6 7
(8, '5', 42, 0.8571428571428571)
8 9 10 11
(16, '9', 110, 0.9090909090909091)

timings

simplify foo to do timings:

In [92]: def foo1(a,b,c,d): 
    ...:     return 2*a, str(b), c*d, c/d 
    ...:                                                                                             
In [93]: f = np.vectorize(foo1)                                                                      

Let's also test application to rows of an array:

In [97]: arr = df.to_numpy()                                                                         
In [99]: [foo1(*row) for row in arr]                                                                 
Out[99]: 
[(0, '1', 6, 0.6666666666666666),
 (8, '5', 42, 0.8571428571428571),
 (16, '9', 110, 0.9090909090909091)]

vectorized is noticeably faster than apply:

In [100]: timeit f(df['a'],df['b'],df['c'],df['d'])                                                  
237 µs ± 3.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [101]: timeit df.apply(lambda x:foo1(*x),1)                                                       
1.04 ms ± 2.51 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

It's even faster than a more direct iteration on rows of the dataframe:

In [102]: timeit [foo1(*df.iloc[i]) for i in range(3)]                                               
528 µs ± 2.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

But application of foo1 to the rows of the array, is faster:

In [103]: timeit [foo1(*row) for row in arr]                                                         
17.5 µs ± 326 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [105]: timeit f(*arr.T)                                                                           
75.1 µs ± 81.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

These last two that show that np.vectorize is slow relative to direct iteration on an array. Iterating on a dataframe, in various ways, adds even more computation time.

hpaulj
  • 221,503
  • 14
  • 230
  • 353