Most efficient way to map function on numpy array using cell information and location

Question

I have a corpus with m documents and n unique words.

Based on this corpus, I want to calculate a co-occurrence matrix for words and calculate their similarity.

To do so, I have created a NumPy array occurrences (m x n), which indicates which words are present in each document.

Based on the occurrences, I have created cooccurrences as follows:

cooccurrences = np.transpose(occurrences) @ occurrences

Furthermore, word_occurrences gives the sum per word in the corpus:

word_occurrences = occurrences.sum(axis=0)

Now, I want to calculate the similarity scores of words in cooccurrences based on the association strength.

I want to divide each cell i, j in cooccurrences, by word_occurrences[i] * word_occurrences[j].

Currently, I loop through cooccurrences to do this.

def calculate_association_strength(cooc, i, j, word_occurrences):
        return cooc/(word_occurrences[i]*word_occurrences[j])


for i in range(len(cooccurrences)):
            for j in range(len(cooccurrences)):
                if i != j:
                    if cooccurrences[i,j] > 0 :
                        cooccurrences[i,j] = 1 - self.calculate_association_strength(cooccurrences[i,j], i,j,word_occurrences)
                else:
                    cooccurrences[i,j] = 0

But with m > 30 000, this is very time-consuming. Is there a faster way to do this?

Here, they discuss mapping a function on a np.array. However, they don't use multiple variables derived from the array.

you may want to take into account counts of words in each document. It will help you differentiate similar VS very similar documents. Second thing - could you make [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example)? — dankal444, Dec 23 '21 at 18:10
I don't know numpy but I can imagine that a numpy-specific approach would be good. Nevertheless, two observations: `word_occurrences[i]*word_occurrences[j]` can be calculated in advance for each `i, j` combination (EDIT: you should be able to save half the calculations `i, j` is the same as `j, i`). This might speed up things a little bit. Also, the final result for each cell does not depend on the other cell values. Your algorithm can be parallelized easily. — Rolf, Jan 04 '22 at 08:53

score -1 · Answer 1 · edited Mar 18 '23 at 12:13

-1

If I understand the problem here correctly, you could vectorise the whole operation, so the result would be:

cooccurrences / word_occurrences.reshape(-1,1) * word_occurrences

This is pretty much guaranteed to work faster than looping through the array.

edited Mar 18 '23 at 12:13

halfer

19,824
17
99
186

answered Dec 23 '21 at 12:11

w_sz

332
1
8

Thank you for your response @w_sz. Unfortunately, your suggested approach results in a different matrix than I currently have. – Emil Dec 23 '21 at 12:48
I must have messed up the dimensions then. I edited the answer and think that if You reshape the other matrix its going to work as intended – w_sz Dec 23 '21 at 13:02
Thanks again. Still, this results in a different matrix. – Emil Dec 23 '21 at 13:06
It worked for me when I compared them with `np.isclose()` so perhaps there might be a very small difference. – w_sz Dec 23 '21 at 13:43

Most efficient way to map function on numpy array using cell information and location

1 Answers1