Python - calculate the co-occurrence matrix

Question

I'm working on an NLP task and I need to calculate the co-occurrence matrix over documents. The basic formulation is as below:

Here I have a matrix with shape (n, length), where each row represents a sentence composed by length words. So there are n sentences with same length in all. Then with a defined context size, e.g., window_size = 5, I want to calculate the co-occurrence matrix D, where the entry in the cth row and wth column is #(w,c), which means the number of times that a context word c appears in w's context.

An example can be referred here. How to calculate the co-occurrence between two words in a window of text?

I know it can be calculate by stacking loops, but I want to know if there exits an simple way or simple function? I have find some answers but they cannot work with a window sliding through the sentence. For example:word-word co-occurrence matrix

So could anyone tell me is there any function in Python can deal with this problem concisely? Cause I think this task is quite common in NLP things.

score 10 · Accepted Answer · edited Oct 31 '18 at 18:44

10

It is not that complicated, I think. Why not make a function for yourself? First get the co-occurrence matrix X according to this tutorial: http://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage Then for each sentence, calculate the co-occurrence and add them to a summary variable.

m = np.zeros([length,length]) # n is the count of all words
def cal_occ(sentence,m):
    for i,word in enumerate(sentence):
        for j in range(max(i-window,0),min(i+window,length)):
             m[word,sentence[j]]+=1
for sentence in X:
    cal_occ(sentence, m)

edited Oct 31 '18 at 18:44

Ruzihm

19,749
5
36
48

answered Jan 15 '17 at 16:23

Zealseeker

823
1
7
23

Thanks a lot. I have saw the `CountVectorizer` function but the occurrence it calculate is not what I want here. In the occurrence matrix I mean here, each item represents the occurrence of a **central** word `w` and it's context word `c`. The function in `CountVectorizer` just calculate every pair of word's occurrence in a sentence. I can only accomplish it by using stack of loop, and I'm wondering whether simple API exits. Thanks for ur reply :) – GEORGE GUO Jan 16 '17 at 04:35
@GEORGEGUO I'm sorry for the mistaken use of CountVectorizer. But you can directly use the transcripted sentances with the code as I gave. In short, it's not complicated and you can make such "API" yourself. – Zealseeker Jan 16 '17 at 05:32
ya I know. Thanks – GEORGE GUO Jan 16 '17 at 06:11
@Holdwin Thanks for your editing. It's really helpful. I'm embarrassed that there were so many mistakes in my answer. – Zealseeker Nov 01 '18 at 01:28
@Ruzihm what is m[word,sentence[j]]+=1? i am getting slicing index error – Akshay Indalkar Jul 19 '19 at 17:25
@AkshayIndalkar It increments the co-occurrence of the word with index `word` and the word with the index at `sentence[j]` by one. Make sure all of the values in `sentence` are between 0 and `length-1`. If you still need help, [ask a question](https://stackoverflow.com/questions/ask) including your code and [a link back to this answer](https://stackoverflow.com/a/41663359/1092820). – Ruzihm Jul 19 '19 at 17:44
`length = len(sentence) for j in range(max(i - window_size, 0), min(i + window_size+1, length)): if i == j: continue m[word, sentence[j]] += 1` – Andrew Matiuk Oct 04 '20 at 17:20
It is a partial code where, to each sentence, *m* is a matrix of sentence length x length, *word* is in the sentence, *sentence* is a list of tokenized sentence, and *sentence[j]* another word in the sentence. – Mello May 03 '22 at 21:41

score 0 · Answer 2 · answered Sep 22 '19 at 12:22

I have calcuated the Cooccurence matrix with window size =2

first write a function which gives correct neighbourhood words (here i have used get context)
Create matrix and just add 1 if the particuar value present in the neighbour hood.

Here is the python code:

import numpy as np
CORPUS=["abc def ijk pqr", "pqr klm opq", "lmn pqr xyz abc def pqr abc"]


top2000 = [ "abc","pqr","def"]#list(set((' '.join(ctxs)).split(' ')))
a = np.zeros((3,3), np.int32)
for  sentence in CORPUS:
    for index,word in enumerate(sentence.split(' ')):
       if word in top2000 : 
           print(word)
           context=GetContext(sentence,index)
           print(context)
           for word2 in context:
             if word2 in top2000:
                 a[top2000.index(word)][top2000.index(word2)]+=1
print(a)

get context function

def GetContext(sentence, index):
words = sentence.split(' ')
ret=[]
for word in words:

        if index==0:
            ret.append(words[index+1])
            ret.append(words[index+2])
        elif index==1:
            ret.append(words[index-1])
            ret.append(words[index+1])
        if len(words)>3:
                ret.append(words[index+2])
        elif index==(len(words)-1):
            ret.append(words[index-2])
            ret.append(words[index-1])
        elif index==(len(words)-2):
            ret.append(words[index-2])
            ret.append(words[index-1])
            ret.append(words[index+1])
        else:
            ret.append(words[index-2])
            ret.append(words[index-1])
            ret.append(words[index+1])
            ret.append(words[index+2])
        return ret

here is result:

array([[0, 3, 3],
   [3, 0, 2],
   [3, 2, 0]])

Python - calculate the co-occurrence matrix

2 Answers2