3

GOAL: I have a large 1d array (3000000+) of distances with many duplicate distances. I am trying to write the swiftest function that returns all distances that appear n times in the array. I have written a function in numpy but there is a bottleneck at one line in the code. Swift performance is an issue because the calculations are done in a for loop for 2400 different large distance arrays.

import numpy as np
for t in range(0, 2400):
 a=np.random.randint(1000000000, 5000000000, 3000000)
 b=np.bincount(a,minlength=np.size(a))
 c=np.where(b == 3)[0] #SLOW STATEMENT/BOTTLENECK
 return c

EXPECTED RESULTS: Given a 1d array of distances [2000000000,3005670000,2000000000,12345667,4000789000,12345687,12345667,2000000000,12345667] I would expect back an array of [2000000000,12345667] when queried to return an array of all distances that appear 3 times in the main array.

What should I do?

user3152377
  • 429
  • 1
  • 5
  • 17
  • 1
    Pretty sure if you just convert the entire list into a set you'll get what you want as fast as possible. [This](https://stackoverflow.com/a/12897477/3715522) should give you a good guideline. – MCBama Dec 12 '17 at 17:23
  • 2
    Converting the list into a set would not work at all. It would remove duplicates preventing him from counting the ones that show up N times. – Rafael Barros Dec 12 '17 at 17:28
  • Ah. I misunderstood the question. – MCBama Dec 12 '17 at 17:29
  • @MCBama thanks for the intro to sets, they may be quite useful for another separate speedup I am trying to do. – user3152377 Dec 12 '17 at 18:03
  • @RafaelBarros Very good to know it strips duplicates! – user3152377 Dec 12 '17 at 18:05
  • What's with the unconditional `return` inside the loop? That doesn't make sense. Can you correct that to what you're actually doing? – Stefan Pochmann Dec 15 '17 at 17:00

2 Answers2

3

Use np.unique :

a=np.random.randint(0,10**6,3*10**6)
uniques,counts=np.unique(a,return_counts=True)
In [293]: uniques[counts==14]
Out[293]: array([  4541, 250510, 622471, 665409, 881697, 920586])

This takes less than a second. but I don't understand why your where statement is slow. For me your solution is faster :

In [313]: %timeit b=np.bincount(a,minlength=a.size)
61.5 ms ± 4.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [314]: %timeit np.where(b==3)[0]
11.8 ms ± 271 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [315]: %timeit uniques,counts=np.unique(a,return_counts=True)
424 ms ± 6.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [316]: %timeit Counter(a)
1.41 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

EDIT

@numba.jit()
def count(a,n):
    counters=np.zeros(10**6,np.int32)
    for i in a:
        counters[i] += 1
    res=np.empty_like(counters)
    k = 0    
    for i,j in enumerate(counters):
        if j == n:
            res[k] = i
            k += 1
    return res[:k]        

This numba function can give you a 3X improvement. for more you must find parallel solutions, on GPU for example.

In [26]: %timeit  count(a,13)
23.6 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
B. M.
  • 18,243
  • 2
  • 35
  • 54
  • Yes! my solution is faster than the unique, I just ran your solution but when put into 2400 loops of large data sets BOTH of our solutions total time comes in at 3min some secs. I was hoping to get the 2400 loops down from 3 minutes to no more than 20-30 secs – user3152377 Dec 12 '17 at 18:00
  • 1
    @user3152377 in all fairness, you're working with a LOT of numbers. Computers are fast but the only way to really reduce the amount of time you're spending here is to get a faster one. from what I understand numpy does a pretty good job of making their stuff efficient so if you're using their code and it's taking time...well...not sure you'll find a much better solution. – MCBama Dec 12 '17 at 19:16
  • Thanks for that honest answer. Do you know if its more ram I should be concerned about in my particular case or simply just processor speed and amount of cores. Right now I am doing this on a laptop (mac) with 16 gb ram---so if I need more ram than that, then laptop would not work. Also, I heard that linux is just plain faster for things like python than mac...have you heard that? – user3152377 Dec 12 '17 at 19:48
  • @B.M. This is awesome! Thank you!!! Max swiftness for up to 6 digit numbers, is there anything I can do to keep the speed when many of the numbers are 10 digits? Ran your code for up to 6 digit numbers and got less than a minute on my computer but back at 4 minutes for digits more than 6. – user3152377 Dec 12 '17 at 21:57
  • For 10 digits data, some others techniques must be involved, like hash table for example. but before studying that, a statistical approach should be interesting : the hist of the counts variable for example. can you give that ? – B. M. Dec 13 '17 at 19:25
1

You can use a Counter:

>>> a = np.array([2000000000,3005670000,2000000000,12345667,4000789000,12345687,12345667,2000000000,12345667])
>>> c = Counter(a)
>>> np.array([i for i in c if c[i] >= 3])
array([2000000000,   12345667])
Joe Iddon
  • 20,101
  • 7
  • 33
  • 54
  • Yes! this is another quick one but when put in 2400 loop takes about the same time as what I have which is 3-4minutes,--was hoping for a solution that would take less than 30 secs total time for the 2400 loops. – user3152377 Dec 12 '17 at 18:13