apply custom function in numpy array

Question

I have a list,

mylist=np.array([120,3,10,33,5,54,2,23,599,801])

and a function:

def getSum(n): 
    n=n**2
    sum = 0
    while (n != 0): 

        sum = sum + int(n % 10) 
        n = int(n/10) 
    if sum <20:
        return True
    return False

I am trying to apply my function to mylist and retrive only those indices are true.

my expected output is.

[120, 3, 10, 33, 5, 54, 2, 23, 801]

I can do that like list(filter(getSum,mylist)), how to use it in numpy.

tried np.where not producing the expected output.

Your function will not work and throws errors at while loop. Can you explain the logic that exactly you want to implement using `numpy`. That could help to understand the problem. — Space Impact, May 13 '19 at 05:16
@SandeepKadapa, i found it, we need `np.vectorize`, I will add answer. Thanks — Pyd, May 13 '19 at 05:18

Lante Dellarovere · Answer 1 · 2019-05-13T18:20:53.727

If you want to check if the sums of the digits are > 20, here a pure numpy solution (here can find how to decompose an integer in its digits):

import numpy as np


mylist=np.array([120,3,10,33,5,54,2,23,599,801])

mylist = mylist**2
max_digits = np.ceil(np.max(np.log10(mylist)))  # max number of digits in mylist
digits = mylist//(10**np.arange(max_digits)[:, None])%10  # matrix of digits
digitsum = np.sum(digits, axis=0)  # array of sums
mask = digitsum < 20
mask
# array([True, True, True, True, True, True, True, True, False, True])

Update: speed comparison

@hpaulj does a nice time comparison among (almost) all proposed solutions.
Winner was filter with a pure list input, while my pure numpy solution did not performed well.
Anyway, if we test them against a wider range of inputs, things change.
Here's a test performed with perflot from @NicoSchlömer.
For input of 100+ elements all solutions are equivalent, while pure numpy is faster:

score 1 · Answer 2 · answered May 13 '19 at 05:25

1

I think there are loops, so better here is use numba:

from numba import jit
@jit(nopython=True)
def get_vals(arr):
    out = np.zeros(arr.shape[0], dtype=bool)
    for i, n in enumerate(arr):

        n=n**2
        sum1 = 0
        while (n != 0): 
            sum1 = sum1 + int(n % 10) 
            n = int(n/10) 
        if sum1 <20:
            out[i] = True
    return arr[out]

print(get_vals(mylist))

answered May 13 '19 at 05:25

jezrael

822,522
95
1,334
1,252

jezrael , thank you. what do you think about my np.vectorize, which one is efficient? – Pyd May 13 '19 at 05:27
@pyd - I think `np.vectorized` should be slowier, the best test in real data. – jezrael May 13 '19 at 05:28

score 1 · Answer 3 · answered May 13 '19 at 05:26

1

Using list comprehension, the underlying concept of np.vectorize is for loop from the documents (also doesn't improve your performance):

mylist[[getSum(i) for i in mylist]]

array([120,   3,  10,  33,   5,  54,   2,  23, 801])

answered May 13 '19 at 05:26

Space Impact

13,085
23
48

hpaulj · Answer 4 · 2019-05-13T07:55:40.473

The function and test array:

In [22]: def getSum(n):  
    ...:     n=n**2 
    ...:     sum = 0 
    ...:     while (n != 0):  
    ...:  
    ...:         sum = sum + int(n % 10)  
    ...:         n = int(n/10)  
    ...:     if sum <20: 
    ...:         return True 
    ...:     return False 
    ...:                                                                        
In [23]: mylist=np.array([120,3,10,33,5,54,2,23,599,801])

Your filter solution:

In [51]: list(filter(getSum, mylist))                                           
Out[51]: [120, 3, 10, 33, 5, 54, 2, 23, 801]

and a sample timing:

In [52]: timeit list(filter(getSum, mylist))                                    
32.8 µs ± 185 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Since this returns a list, and iterates, it should be faster if mylist was a list, rather than an array:

In [53]: %%timeit alist=mylist.tolist() 
    ...: list(filter(getSum, alist))                                                                        
18.4 µs ± 378 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

alternatives

You proposed use of np.vectorize:

In [56]: f = np.vectorize(getSum); mylist[f(mylist)]                            
Out[56]: array([120,   3,  10,  33,   5,  54,   2,  23, 801])
In [57]: timeit f = np.vectorize(getSum); mylist[f(mylist)]                     
63.4 µs ± 151 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [58]: timeit mylist[f(mylist)]                                               
57.6 µs ± 920 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Oops! that's quite a bit slower, even if we remove the f creation from the timing loop. vectorize is pretty, but does not promise speed.

I've found that frompyfunc is faster than np.vectorize (though they are related):

In [59]: g = np.frompyfunc(getSum, 1,1)                                         
In [60]: g(mylist)                                                              
Out[60]: 
array([True, True, True, True, True, True, True, True, False, True],
      dtype=object)

the result is object dtype, which in this case has to be converted to bool:

In [63]: timeit mylist[g(mylist).astype(bool)]                                  
25.5 µs ± 233 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

That's better than your filter - but only if applied to the array, not the list.

@Saandeep proposed a list comprehension:

In [65]: timeit mylist[[getSum(i) for i in mylist]]                             
40.7 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

That's a bit slower than your filter.

A faster way to use list comprehension is:

 [i for i in mylist if getSum(i)]

This times the same as your filter - for both the array and list versions (I lost the session where I was timing things).

pure numpy

@lante worked out a pure numpy solution, clever but a bit obscure. I haven't worked out the logic:

def lante(mylist):
    max_digits = np.ceil(np.max(np.log10(mylist)))  # max number of digits in mylist
    digits = mylist//(10**np.arange(max_digits)[:, None])%10  # matrix of digits
    digitsum = np.sum(digits, axis=0)  # array of sums
    mask = digitsum > 20
    return mask

And unfortunately not a speed demon:

In [69]: timeit mylist[~lante(mylist)]                                          
58.9 µs ± 757 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

I don't have numba installed, so can't time @jezrael's solution.

So your original filter is a good solution, especially if you start with a list rather than an array. Especially when considering conversion times, a good Python list solution is often better than numpy one.

Timings may be different with a large example, but I don't expect any upsets.

nice comparison. but I suggest you to try with large samples. You'll be surprised by the upsets (not here to start a speed war). also a 100 numbers list I guess — Lante Dellarovere, May 13 '19 at 08:23
@LanteDellarovere, your solution does scale better - but for some reason doesn't produce the same result for big array like `blist=np.random.randint(1,1000,10000) ` — hpaulj, May 13 '19 at 15:45
I just noticed that OP sums digits of squared elements (not original from `mylist`), and also test if are `<20` (not `>20` as I did). Edited. A plus time for `mylist**2` creation has to be added, but still numpy version will be much faster. — Lante Dellarovere, May 13 '19 at 15:54

score 0 · Answer 5 · answered May 13 '19 at 05:19

0

vec=np.vectorize(getSum)
mylist[vec(mylist)]
out[]:
array([120,   3,  10,  33,   5,  54,   2,  23, 801])

answered May 13 '19 at 05:19

Pyd

6,017
18
52
109

apply custom function in numpy array

5 Answers5

alternatives

pure numpy