0

I have a function which takes as input 2 arrays of zeros and ones ~8000 elements per array. My function eps calculates a statistic on these arrays and returns the output. It is easy operations just checking for 0 and noting the index where 0 is found in array. I tried my best to optimize for speed but the best I could get is 4.5 ~5 seconds (for 18k array pairs) using timeit library. Time is important as I need to run this function on billions of array pairs.

    #e.g. inputs
    #ts_1 = [0,1,1,0,0,1,1,0,......]
    #ts_2 = [1,1,1,1,1,1,1,0,......]
    # tau = any integer or float
    
def eps(ts_1, ts_2, tau):  
    
    n1 = 0
    n2 = 0
    Q_tau = 0
    q_tau = 0

    event_index1 = [index for index, item in enumerate(ts_1) if item == 0]
    n1 = ts_1.count(0)
    event_index2 = [index for index, item in enumerate(ts_2) if item == 0]
    n2 = ts_2.count(0)


    # tried numpy based on @Ram comment below, no improvement
    event_index1, = np.where(np.array(ts_1) == 0)
    n1 = event_index1.shape[0]
    
    event_index2, = np.where(np.array(ts_2) == 0)
    n2 = event_index2.shape[0]
    # tried numpy based on @Ram comment below, no improvement   

    if (n1 == 0 or n2 == 0):
        Q_tau = 0
    else:
        c_ij = 0  
        matching_idx = set(event_index1).intersection(event_index2)
        c_ij = c_ij + (0.5 *len(matching_idx) )
        
        for x,y in product(event_index1,event_index2):
            if x-y > 0 and (x-y)<= tau:
                c_ij = c_ij +1  
        
        c_ji = 0            
        matching_idx_2 = set(event_index2).intersection(event_index1)         
        c_ji = c_ji + (0.5 *len(matching_idx_2) )
        
        for x,y in product(event_index2,event_index1):
            if x-y > 0 and (x-y)<= tau:
                c_ji = c_ji +1                       
                  
        Q_tau = (c_ij+c_ji)/math.sqrt( n1 * n2 )
        q_tau = (c_ij - c_ji)/math.sqrt( n1 * n2 )
    
    return Q_tau, q_tau
skynaive
  • 49
  • 5
  • 1
    I think It's better to use ```numpy``` if you want a faster way. Please go through this answer - https://stackoverflow.com/a/18079151/2773206 – Ram Aug 16 '21 at 15:05
  • First you could replace `n1 = ts_1.count(0)` with `n1 = len(event_index1)` – Tranbi Aug 16 '21 at 15:09
  • Second, what is the purpose of `matching_idx_2 = set(event_index2).intersection(event_index1)`? it's the same as `matching_idx` – Tranbi Aug 16 '21 at 15:14
  • @Ram Thanks for the suggestion, I used numpy array but the no improvement, instead time increased to 5.8secs – skynaive Aug 16 '21 at 19:46
  • This question is more suited for https://codereview.stackexchange.com – smac89 Aug 16 '21 at 19:55
  • 1
    @smac89 Thank you I posted it there https://codereview.stackexchange.com/questions/266108/fastest-way-to-iterate-2-arrays-and-do-some-operations – skynaive Aug 16 '21 at 20:16
  • I'm voting to close this post as "too broad" because ["_open-ended questions diminish the usefulness of our site and push other questions off the front page._"](https://stackoverflow.com/help/dont-ask) – Sᴀᴍ Onᴇᴌᴀ Aug 16 '21 at 22:13

2 Answers2

0

Based on my comments earlier, and considering that permuting two lists in a product will give you the same tuples just inverted, you could reduce your code to:

def eps(ts_1, ts_2, tau):  
    Q_tau = 0
    q_tau = 0

    event_index1 = [index for index, item in enumerate(ts_1) if item == 0]
    n1 = len(event_index1)
    event_index2 = [index for index, item in enumerate(ts_2) if item == 0]
    n2 = len(event_index2)
    
    if (n1 != 0 and n2 != 0):
        matching_idx = set(event_index1).intersection(event_index2)
        c_ij = c_ji = 0.5 *len(matching_idx)
        
        for x,y in product(event_index1,event_index2):
            if x-y > 0 and (x-y)<= tau:
                c_ij += 1
            elif y-x > 0 and (y-x) <= tau:
                c_ji += 1                     
                  
        Q_tau = (c_ij+c_ji)/math.sqrt( n1 * n2 )
        q_tau = (c_ij - c_ji)/math.sqrt( n1 * n2 )
    
    return Q_tau, q_tau
Tranbi
  • 11,407
  • 6
  • 16
  • 33
  • Thanks for the cleanup, but the timing is almost the same at 4.7 ~ 4.9 seconds for around 18k arrray pairs. That amount to ~.0002 seconds per array pair. I am not sure if I can improve on this. As this is a bottleneck in my computation that i need to perform billion times – skynaive Aug 16 '21 at 19:25
  • Could you share a sample of ts1 and ts2? As well as a typical value for tau. You can update your question with the details. – Tranbi Aug 16 '21 at 20:38
  • I added it Done :) – skynaive Aug 16 '21 at 20:46
0

First, is there any reason to treat the arrays as pair in your code ? They could be worked out separately. But in your code you then check for matching index... you do not mention this in your initial description. Do arrays have a fixed size or can this vary ?

I would suggest using numpy indexing with boolean arrays to quickly get your 0 positions:

a = np.random.uniform( 0,1, 8000).astype('bool')
b = np.random.uniform( 0,1, 8000).astype('bool')
t0 = time.perf_counter()
for n in range(0,18000):
    arr = np.arange(0,len(a))
    arr2 = np.arange(0,len(b))
    
    event_index1 = arr[~a]
    event_index2 = arr[~b]
print('this took', round(time.perf_counter()-t0,3), 's') #this took 0.198 s

This seem to be 100 times faster.

For the rest that you seem to need in your code example : To get matching index you could just sum the two arrays and count the number of False/0 (will be 0 only if both have 0). I'm not sure if calculating the two relative intersections yields different results for you ? It seems to me they are identical. In your final calculus, if you only use n1 and n2 in the end you do not even need to get indexing for ts_1 and ts_2, you can directly do n1 = np.sum(ts_1==0) In fact maybe you need no indexing at all.

Adrien Mau
  • 181
  • 5