-1

I have a list,

mylist=np.array([120,3,10,33,5,54,2,23,599,801])

and a function:

def getSum(n): 
    n=n**2
    sum = 0
    while (n != 0): 

        sum = sum + int(n % 10) 
        n = int(n/10) 
    if sum <20:
        return True
    return False

I am trying to apply my function to mylist and retrive only those indices are true.

my expected output is.

[120, 3, 10, 33, 5, 54, 2, 23, 801]

I can do that like list(filter(getSum,mylist)), how to use it in numpy.

tried np.where not producing the expected output.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
Pyd
  • 6,017
  • 18
  • 52
  • 109
  • 1
    Your function will not work and throws errors at while loop. Can you explain the logic that exactly you want to implement using `numpy`. That could help to understand the problem. – Space Impact May 13 '19 at 05:16
  • @SandeepKadapa, i found it, we need `np.vectorize`, I will add answer. Thanks – Pyd May 13 '19 at 05:18

5 Answers5

2

If you want to check if the sums of the digits are > 20, here a pure numpy solution (here can find how to decompose an integer in its digits):

import numpy as np


mylist=np.array([120,3,10,33,5,54,2,23,599,801])

mylist = mylist**2
max_digits = np.ceil(np.max(np.log10(mylist)))  # max number of digits in mylist
digits = mylist//(10**np.arange(max_digits)[:, None])%10  # matrix of digits
digitsum = np.sum(digits, axis=0)  # array of sums
mask = digitsum < 20
mask
# array([True, True, True, True, True, True, True, True, False, True])

Update: speed comparison

@hpaulj does a nice time comparison among (almost) all proposed solutions.
Winner was filter with a pure list input, while my pure numpy solution did not performed well.
Anyway, if we test them against a wider range of inputs, things change.
Here's a test performed with perflot from @NicoSchlömer.
For input of 100+ elements all solutions are equivalent, while pure numpy is faster: enter image description here

Lante Dellarovere
  • 1,838
  • 2
  • 7
  • 10
1

I think there are loops, so better here is use numba:

from numba import jit
@jit(nopython=True)
def get_vals(arr):
    out = np.zeros(arr.shape[0], dtype=bool)
    for i, n in enumerate(arr):

        n=n**2
        sum1 = 0
        while (n != 0): 
            sum1 = sum1 + int(n % 10) 
            n = int(n/10) 
        if sum1 <20:
            out[i] = True
    return arr[out]

print(get_vals(mylist))
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

Using list comprehension, the underlying concept of np.vectorize is for loop from the documents (also doesn't improve your performance):

mylist[[getSum(i) for i in mylist]]

array([120,   3,  10,  33,   5,  54,   2,  23, 801])
Space Impact
  • 13,085
  • 23
  • 48
1

The function and test array:

In [22]: def getSum(n):  
    ...:     n=n**2 
    ...:     sum = 0 
    ...:     while (n != 0):  
    ...:  
    ...:         sum = sum + int(n % 10)  
    ...:         n = int(n/10)  
    ...:     if sum <20: 
    ...:         return True 
    ...:     return False 
    ...:                                                                        
In [23]: mylist=np.array([120,3,10,33,5,54,2,23,599,801])                       

Your filter solution:

In [51]: list(filter(getSum, mylist))                                           
Out[51]: [120, 3, 10, 33, 5, 54, 2, 23, 801]

and a sample timing:

In [52]: timeit list(filter(getSum, mylist))                                    
32.8 µs ± 185 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Since this returns a list, and iterates, it should be faster if mylist was a list, rather than an array:

In [53]: %%timeit alist=mylist.tolist() 
    ...: list(filter(getSum, alist))                                                                        
18.4 µs ± 378 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

alternatives

You proposed use of np.vectorize:

In [56]: f = np.vectorize(getSum); mylist[f(mylist)]                            
Out[56]: array([120,   3,  10,  33,   5,  54,   2,  23, 801])
In [57]: timeit f = np.vectorize(getSum); mylist[f(mylist)]                     
63.4 µs ± 151 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [58]: timeit mylist[f(mylist)]                                               
57.6 µs ± 920 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Oops! that's quite a bit slower, even if we remove the f creation from the timing loop. vectorize is pretty, but does not promise speed.

I've found that frompyfunc is faster than np.vectorize (though they are related):

In [59]: g = np.frompyfunc(getSum, 1,1)                                         
In [60]: g(mylist)                                                              
Out[60]: 
array([True, True, True, True, True, True, True, True, False, True],
      dtype=object)

the result is object dtype, which in this case has to be converted to bool:

In [63]: timeit mylist[g(mylist).astype(bool)]                                  
25.5 µs ± 233 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

That's better than your filter - but only if applied to the array, not the list.

@Saandeep proposed a list comprehension:

In [65]: timeit mylist[[getSum(i) for i in mylist]]                             
40.7 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

That's a bit slower than your filter.

A faster way to use list comprehension is:

 [i for i in mylist if getSum(i)]

This times the same as your filter - for both the array and list versions (I lost the session where I was timing things).

pure numpy

@lante worked out a pure numpy solution, clever but a bit obscure. I haven't worked out the logic:

def lante(mylist):
    max_digits = np.ceil(np.max(np.log10(mylist)))  # max number of digits in mylist
    digits = mylist//(10**np.arange(max_digits)[:, None])%10  # matrix of digits
    digitsum = np.sum(digits, axis=0)  # array of sums
    mask = digitsum > 20
    return mask

And unfortunately not a speed demon:

In [69]: timeit mylist[~lante(mylist)]                                          
58.9 µs ± 757 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

I don't have numba installed, so can't time @jezrael's solution.

So your original filter is a good solution, especially if you start with a list rather than an array. Especially when considering conversion times, a good Python list solution is often better than numpy one.

Timings may be different with a large example, but I don't expect any upsets.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • nice comparison. but I suggest you to try with large samples. You'll be surprised by the upsets (not here to start a speed war). also a 100 numbers list I guess – Lante Dellarovere May 13 '19 at 08:23
  • @LanteDellarovere, your solution does scale better - but for some reason doesn't produce the same result for big array like `blist=np.random.randint(1,1000,10000) ` – hpaulj May 13 '19 at 15:45
  • I just noticed that OP sums digits of squared elements (not original from `mylist`), and also test if are `<20` (not `>20` as I did). Edited. A plus time for `mylist**2` creation has to be added, but still numpy version will be much faster. – Lante Dellarovere May 13 '19 at 15:54
  • If you are interested, I extended your time comparison. – Lante Dellarovere May 13 '19 at 18:04
0
vec=np.vectorize(getSum)
mylist[vec(mylist)]
out[]:
array([120,   3,  10,  33,   5,  54,   2,  23, 801])
Pyd
  • 6,017
  • 18
  • 52
  • 109