I have a corpus with m documents and n unique words.
Based on this corpus, I want to calculate a co-occurrence matrix for words and calculate their similarity.
To do so, I have created a NumPy array occurrences
(m x n), which indicates which words are present in each document.
Based on the occurrences
, I have created cooccurrences
as follows:
cooccurrences = np.transpose(occurrences) @ occurrences
Furthermore, word_occurrences
gives the sum per word in the corpus:
word_occurrences = occurrences.sum(axis=0)
Now, I want to calculate the similarity scores of words in cooccurrences
based on the association strength.
I want to divide each cell i, j in cooccurrences
, by word_occurrences[i] * word_occurrences[j]
.
Currently, I loop through cooccurrences
to do this.
def calculate_association_strength(cooc, i, j, word_occurrences):
return cooc/(word_occurrences[i]*word_occurrences[j])
for i in range(len(cooccurrences)):
for j in range(len(cooccurrences)):
if i != j:
if cooccurrences[i,j] > 0 :
cooccurrences[i,j] = 1 - self.calculate_association_strength(cooccurrences[i,j], i,j,word_occurrences)
else:
cooccurrences[i,j] = 0
But with m > 30 000, this is very time-consuming. Is there a faster way to do this?
Here, they discuss mapping a function on a np.array. However, they don't use multiple variables derived from the array.