How to apply my own function along each rows and columns with NumPy

Question

I'm using NumPy to store data into matrices. I'm struggling to make the below Python code perform better. RESULT is the data store I want to put the data into.

TMP = np.array([[1,1,0],[0,0,1],[1,0,0],[0,1,1]])
n_row, n_col = TMP.shape[0], TMP.shape[0]
RESULT = np.zeros((n_row, n_col))

def do_something(array1, array2):
    intersect_num = np.bitwise_and(array1, array2).sum()
    union_num = np.bitwise_or(array1, array2).sum()
    try:
        return intersect_num / float(union_num)
    except ZeroDivisionError:
        return 0

for i in range(n_row):
    for j in range(n_col):
    if i >= j:
        continue
    RESULT[i, j] = do_something(TMP[i], TMP[j])

I guess it would be much faster if I could use some NumPy built-in function instead of for-loops.

I was looking for the various questions around here, but I couldn't find the best fit for my problem. Any suggestion? Thanks in advance!

If you are looking for a generic solution, use the linked question. For a specific function, edit your question. — Divakar, Jul 31 '16 at 11:45
The `duplicate` was probably chosen because of its age and high score. All it suggests is `np.vectorize`. It ignores the fact that you are skipping the upper triangle. Try `for j in range(I,ncol)`. For bettter help tell us more about the function. — hpaulj, Jul 31 '16 at 12:15
`np.vectorize` is useful if you need help broadcasting; otherwise is syntactic sugar. In this case it could be harmful. It first tries to evaluate return type with a test calculation. Also it provides values not indices. So there's no way to test for upper or lower triangle — hpaulj, Jul 31 '16 at 12:20
Also the function takes rows. `vectorize` feeds scalars. That is a **bad duplicate**. — hpaulj, Jul 31 '16 at 12:28
Thanks, hpaulj. I edited the function. Briefly, the function takes two arrays and return jaccard similarity. — nonpara, Jul 31 '16 at 12:39

Divakar · Accepted Answer · 2016-07-31T13:36:34.447

2

Approach #1

You could do something like this as a vectorized solution -

# Store number of rows in TMP as a paramter
N = TMP.shape[0]  

# Get the indices that would be used as row indices to select rows off TMP and 
# also as row,column indices for setting output array. These basically correspond
# to the iterators involved in the loopy implementation
R,C = np.triu_indices(N,1)    

# Calculate intersect_num, union_num and division results across all iterations
I = np.bitwise_and(TMP[R],TMP[C]).sum(-1)
U = np.bitwise_or(TMP[R],TMP[C]).sum(-1)
vals = np.true_divide(I,U)

# Setup output array and assign vals into it
out = np.zeros((N, N))
out[R,C] = vals

Approach #2

For cases with TMP holding 1s and 0s, those np.bitwise_and and np.bitwise_or would be replaceable with dot-products and as such could be faster alternatives. So, with those we would have an implementation like so -

M = TMP.shape[1]   
I = TMP.dot(TMP.T)
TMP_inv = 1-TMP
U = M - TMP_inv.dot(TMP_inv.T)
out = np.triu(np.true_divide(I,U),1)

edited Jul 31 '16 at 13:36

answered Jul 31 '16 at 12:55

Divakar

218,885
19
262
358

@yukimilk Also, would `TMP` always have `1s` and `0s` only? – Divakar Jul 31 '16 at 13:25
Yes. it only has 0 or 1 as type of np.int – nonpara Jul 31 '16 at 13:31
@yukimilk Added another approach using dot-product, check it out too! – Divakar Jul 31 '16 at 13:42
That is awesome. I really appreciate your big help, Divakar! – nonpara Jul 31 '16 at 13:48

How to apply my own function along each rows and columns with NumPy

1 Answers1