1

I have a corpus with m documents and n unique words.

Based on this corpus, I want to calculate a co-occurrence matrix for words and calculate their similarity.

To do so, I have created a NumPy array occurrences (m x n), which indicates which words are present in each document.

Based on the occurrences, I have created cooccurrences as follows:

cooccurrences = np.transpose(occurrences) @ occurrences

Furthermore, word_occurrences gives the sum per word in the corpus:

word_occurrences = occurrences.sum(axis=0)

Now, I want to calculate the similarity scores of words in cooccurrences based on the association strength.

I want to divide each cell i, j in cooccurrences, by word_occurrences[i] * word_occurrences[j].

Currently, I loop through cooccurrences to do this.

def calculate_association_strength(cooc, i, j, word_occurrences):
        return cooc/(word_occurrences[i]*word_occurrences[j])


for i in range(len(cooccurrences)):
            for j in range(len(cooccurrences)):
                if i != j:
                    if cooccurrences[i,j] > 0 :
                        cooccurrences[i,j] = 1 - self.calculate_association_strength(cooccurrences[i,j], i,j,word_occurrences)
                else:
                    cooccurrences[i,j] = 0

But with m > 30 000, this is very time-consuming. Is there a faster way to do this?

Here, they discuss mapping a function on a np.array. However, they don't use multiple variables derived from the array.

Emil
  • 1,531
  • 3
  • 22
  • 47
  • you may want to take into account counts of words in each document. It will help you differentiate similar VS very similar documents. Second thing - could you make [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example)? – dankal444 Dec 23 '21 at 18:10
  • I don't know numpy but I can imagine that a numpy-specific approach would be good. Nevertheless, two observations: `word_occurrences[i]*word_occurrences[j]` can be calculated in advance for each `i, j` combination (EDIT: you should be able to save half the calculations `i, j` is the same as `j, i`). This might speed up things a little bit. Also, the final result for each cell does not depend on the other cell values. Your algorithm can be parallelized easily. – Rolf Jan 04 '22 at 08:53

1 Answers1

-1

If I understand the problem here correctly, you could vectorise the whole operation, so the result would be:

cooccurrences / word_occurrences.reshape(-1,1) * word_occurrences

This is pretty much guaranteed to work faster than looping through the array.

halfer
  • 19,824
  • 17
  • 99
  • 186
w_sz
  • 332
  • 1
  • 8
  • Thank you for your response @w_sz. Unfortunately, your suggested approach results in a different matrix than I currently have. – Emil Dec 23 '21 at 12:48
  • I must have messed up the dimensions then. I edited the answer and think that if You reshape the other matrix its going to work as intended – w_sz Dec 23 '21 at 13:02
  • Thanks again. Still, this results in a different matrix. – Emil Dec 23 '21 at 13:06
  • It worked for me when I compared them with `np.isclose()` so perhaps there might be a very small difference. – w_sz Dec 23 '21 at 13:43